0%| | 0/2230 [00:00> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:39,147 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:40,443 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:41,108 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:42,285 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:42,990 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:44,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:44,801 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:45,938 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:46,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:47,691 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:48,338 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:49,480 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:50,102 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 10.1925, 'learning_rate': 0.0, 'epoch': 0.0} [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:51,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:51,876 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/2230 [00:15<9:29:51, 15.34s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:32:53,067 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:53,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:54,812 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:55,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:56,558 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:57,199 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:32:58,314 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:32:58,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:00,056 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:00,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:01,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:02,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:03,552 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:04,170 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 10.3094, 'learning_rate': 0.0, 'epoch': 0.0} [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:05,304 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:05,945 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 2/2230 [00:29<9:02:01, 14.60s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:33:07,118 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:07,756 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:08,850 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:09,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:10,580 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:11,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:12,338 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:12,975 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:14,102 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:14,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:15,837 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:16,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:17,537 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:18,174 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 10.0606, 'learning_rate': 6e-07, 'epoch': 0.01} [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:19,261 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:19,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 3/2230 [00:43<8:56:35, 14.46s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:33:21,447 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:22,082 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:23,163 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:23,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:24,858 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:25,502 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:26,601 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:27,228 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:28,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:28,955 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:30,049 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:30,653 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:31,775 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:32,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 10.1729, 'learning_rate': 6e-07, 'epoch': 0.01} [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:33,473 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:34,073 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▏ | 4/2230 [00:57<8:47:10, 14.21s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:33:35,203 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:35,807 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:36,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:37,543 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:38,622 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:39,224 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:40,342 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:40,945 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:42,017 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:42,636 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:43,696 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:44,284 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:45,340 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:45,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:47,007 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:47,628 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▏ | 5/2230 [01:11<8:38:36, 13.99s/it] 0%|▏ | 5/2230 [01:11<8:38:36, 13.99s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:33:48,797 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:49,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:50,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:51,111 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:52,183 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:52,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:53,877 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:54,498 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:55,566 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:56,166 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:57,235 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:57,837 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:33:58,913 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:33:59,545 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:00,608 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:01,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▏ | 6/2230 [01:24<8:33:23, 13.85s/it] 0%|▏ | 6/2230 [01:24<8:33:23, 13.85s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:34:02,382 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:02,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:04,116 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:04,747 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:05,840 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:06,459 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:07,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:08,102 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:09,161 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:09,758 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:10,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:11,450 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:12,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:13,091 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:14,169 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 10.1223, 'learning_rate': 2.4e-06, 'epoch': 0.02} [WARNING|modeling_utils.py:388] 2022-03-22 16:34:14,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▎ | 7/2230 [01:38<8:29:22, 13.75s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:34:15,895 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:16,491 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:17,549 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:18,166 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:19,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:19,853 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:20,890 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:21,491 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:22,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:23,136 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:24,198 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:24,807 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:25,874 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:26,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:27,556 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:28,147 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▎ | 8/2230 [01:51<8:24:56, 13.63s/it] 0%|▎ | 8/2230 [01:51<8:24:56, 13.63s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:34:29,282 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:29,872 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:30,929 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:31,559 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:32,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:33,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:34,278 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:34,857 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:35,909 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:36,486 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:37,527 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:38,134 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:39,214 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:39,821 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 9.0957, 'learning_rate': 3.6e-06, 'epoch': 0.02} [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:40,861 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:41,451 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▎ | 9/2230 [02:04<8:20:51, 13.53s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:34:42,579 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:43,166 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:44,198 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:44,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:45,839 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:46,446 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:47,480 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:48,084 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:49,110 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:49,700 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:51,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:52,382 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:52,991 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:54,011 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:54,604 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▎ | 10/2230 [02:18<8:16:19, 13.41s/it] 0%|▎ | 10/2230 [02:18<8:16:19, 13.41s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:34:55,717 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:56,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:57,336 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:57,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:34:58,969 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:34:59,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:00,604 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:01,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:02,183 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:02,753 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:03,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:04,375 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:05,410 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:05,984 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:07,016 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:07,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed {'loss': 8.4451, 'learning_rate': 4.8e-06, 'epoch': 0.02} 0%|▍ | 11/2230 [02:31<8:11:15, 13.28s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:35:08,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:09,280 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:10,322 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:10,911 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:11,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:12,496 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:13,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:14,076 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:15,096 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:15,683 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:16,719 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:17,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:18,339 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:18,923 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:19,934 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:20,505 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 1%|▍ | 12/2230 [02:44<8:06:54, 13.17s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:35:21,609 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 7.9992, 'learning_rate': 5.399999999999999e-06, 'epoch': 0.03} [WARNING|modeling_utils.py:388] 2022-03-22 16:35:22,200 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:23,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:23,818 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:24,826 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:25,406 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:26,420 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:28,972 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:30,008 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:30,598 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:31,606 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:32,175 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:33,179 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:33,751 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:34,755 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:35,325 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 1%|▍ | 13/2230 [02:58<8:25:07, 13.67s/it] 1%|▍ | 13/2230 [02:58<8:25:07, 13.67s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:35:36,503 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:37,092 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:38,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:38,702 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:39,724 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:40,315 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:41,332 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:41,915 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:42,941 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:43,525 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:44,530 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:45,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:46,092 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:46,660 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:47,658 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:48,226 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 1%|▍ | 14/2230 [03:11<8:16:19, 13.44s/it] 1%|▍ | 14/2230 [03:11<8:16:19, 13.44s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:35:49,896 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 1%|▍ | 14/2230 [03:11<8:16:19, 13.44s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:52,496 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:55,637 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:55,637 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:58,798 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:35:58,798 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:35:49,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 15/2230 [03:24<8:08:22, 13.23s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 15/2230 [03:24<8:08:22, 13.23s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:05,160 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:08,305 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:08,305 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:11,448 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:11,448 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:02,048 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 16/2230 [03:37<8:01:42, 13.05s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 16/2230 [03:37<8:01:42, 13.05s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:17,818 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:17,818 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:20,940 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:24,029 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:24,029 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 17/2230 [03:49<7:55:41, 12.90s/it] Setting `use_cache=False`...1] 2022-03-22 16:36:14,712 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▌ | 17/2230 [03:49<7:55:41, 12.90s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:30,234 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:30,234 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:33,324 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:36,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:36,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 18/2230 [04:01<7:49:06, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 16:36:27,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 18/2230 [04:01<7:49:06, 12.72s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:42,620 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:42,620 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:45,683 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:48,718 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:48,718 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 19/2230 [04:14<7:44:57, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 16:36:39,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 19/2230 [04:14<7:44:57, 12.62s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:54,980 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:54,980 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:36:58,041 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:01,098 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:01,098 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 20/2230 [04:26<7:41:28, 12.53s/it] Setting `use_cache=False`...1] 2022-03-22 16:36:51,889 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 20/2230 [04:26<7:41:28, 12.53s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:07,301 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:07,301 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:10,387 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:13,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 21/2230 [04:39<7:39:23, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 21/2230 [04:39<7:39:23, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 16:37:04,221 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▋ | 21/2230 [04:39<7:39:23, 12.48s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:16,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:19,643 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:16,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:19,643 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:16,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:22,716 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:16,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:25,707 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:16,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 22/2230 [04:51<7:36:17, 12.40s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 22/2230 [04:51<7:36:17, 12.40s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 6.0982, 'learning_rate': 1.14e-05, 'epoch': 0.05} [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:31,793 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:31,793 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:34,804 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:37,825 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 23/2230 [05:03<7:32:48, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 23/2230 [05:03<7:32:48, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 16:37:28,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 23/2230 [05:03<7:32:48, 12.31s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:40,866 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:43,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:40,866 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:46,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:40,866 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:46,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:40,866 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:49,898 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:40,866 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 24/2230 [05:15<7:29:51, 12.24s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▊ | 24/2230 [05:15<7:29:51, 12.24s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.8474, 'learning_rate': 1.26e-05, 'epoch': 0.05} [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:55,926 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:58,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:37:58,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:01,839 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:01,839 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:37:52,916 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 25/2230 [05:29<7:47:26, 12.72s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 25/2230 [05:29<7:47:26, 12.72s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.5599, 'learning_rate': 1.3199999999999997e-05, 'epoch': 0.06} [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:09,778 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.7598, 'learning_rate': 1.3799999999999998e-05, 'epoch': 0.06} [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:38:12,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.5277, 'learning_rate': 1.44e-05, 'epoch': 0.06} 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.2876, 'learning_rate': 1.4999999999999999e-05, 'epoch': 0.06} 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|▉ | 27/2230 [05:52<7:30:59, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.1947, 'learning_rate': 1.5599999999999996e-05, 'epoch': 0.07} [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:38:48,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.2972, 'learning_rate': 1.68e-05, 'epoch': 0.07} 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█ | 30/2230 [06:27<7:14:02, 11.84s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.059, 'learning_rate': 1.74e-05, 'epoch': 0.07} 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.1491, 'learning_rate': 1.7999999999999997e-05, 'epoch': 0.07} 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 1%|█▏ | 32/2230 [06:50<7:02:50, 11.54s/it]g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:47,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:47,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.0233, 'learning_rate': 1.8599999999999998e-05, 'epoch': 0.08} [WARNING|modeling_utils.py:388] 2022-03-22 16:39:47,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:47,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:47,304 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:57,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:57,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:39:57,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:01,798 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:01,798 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:01,798 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:01,798 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.9906, 'learning_rate': 1.98e-05, 'epoch': 0.08} g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:17,970 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▎ | 37/2230 [07:43<6:27:40, 10.61s/it] Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▎ | 37/2230 [07:43<6:27:40, 10.61s/it] Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.091, 'learning_rate': 2.04e-05, 'epoch': 0.08} 2%|█▎ | 37/2230 [07:43<6:27:40, 10.61s/it] Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▎ | 37/2230 [07:43<6:27:40, 10.61s/it] Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▎ | 37/2230 [07:43<6:27:40, 10.61s/it] Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:30,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:30,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:30,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:34,634 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:34,634 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:34,634 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:40,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:40,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:42,820 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:42,820 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:46,671 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:40:46,671 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:38:06,835 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▍ | 40/2230 [08:13<6:09:06, 10.11s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 2%|█▍ | 40/2230 [08:13<6:09:06, 10.11s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:53,038 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:53,038 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:53,038 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:58,389 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:40:58,389 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:00,503 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:02,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:04,535 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:06,555 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:06,555 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:08,566 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:10,423 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:12,277 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:14,105 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:14,105 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:15,972 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:17,703 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:21,083 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:21,083 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:22,766 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:25,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:25,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:28,949 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:30,322 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:32,972 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:32,972 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:34,332 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:36,772 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:39,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:39,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:41,173 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:43,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:43,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:45,001 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:47,610 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:47,610 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:49,083 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:49,083 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:49,083 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7624, 'learning_rate': 2.8199999999999998e-05, 'epoch': 0.11} [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:53,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:53,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:41:57,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:01,134 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:01,134 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:04,662 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:04,662 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.9032, 'learning_rate': 2.88e-05, 'epoch': 0.11} [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:08,325 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:11,830 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:11,830 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:15,316 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:15,316 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:18,795 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:18,795 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.7495, 'learning_rate': 2.94e-05, 'epoch': 0.12} [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:22,332 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:25,854 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:25,854 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:29,273 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:29,273 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:32,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:32,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:36,187 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:36,187 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.2808, 'learning_rate': 3.06e-05, 'epoch': 0.12} [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8594, 'learning_rate': 3.119999999999999e-05, 'epoch': 0.12} [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.995, 'learning_rate': 3.1799999999999994e-05, 'epoch': 0.13} [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:42:39,547 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.0311, 'learning_rate': 3.2399999999999995e-05, 'epoch': 0.13} 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.0366, 'learning_rate': 3.2999999999999996e-05, 'epoch': 0.13} 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 57/2230 [10:50<7:46:31, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.9435, 'learning_rate': 3.36e-05, 'epoch': 0.13} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8193, 'learning_rate': 3.42e-05, 'epoch': 0.13} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8357, 'learning_rate': 3.48e-05, 'epoch': 0.14} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8625, 'learning_rate': 3.539999999999999e-05, 'epoch': 0.14} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7677, 'learning_rate': 3.5999999999999994e-05, 'epoch': 0.14} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8096, 'learning_rate': 3.6599999999999995e-05, 'epoch': 0.14} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7863, 'learning_rate': 3.7199999999999996e-05, 'epoch': 0.15} 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██ | 59/2230 [11:17<7:53:27, 13.08s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6589, 'learning_rate': 3.78e-05, 'epoch': 0.15} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7568, 'learning_rate': 3.84e-05, 'epoch': 0.15} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7222, 'learning_rate': 3.9e-05, 'epoch': 0.15} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7684, 'learning_rate': 3.96e-05, 'epoch': 0.15} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6798, 'learning_rate': 4.02e-05, 'epoch': 0.16} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8219, 'learning_rate': 4.08e-05, 'epoch': 0.16} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8288, 'learning_rate': 4.14e-05, 'epoch': 0.16} 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▎ | 66/2230 [12:49<7:46:55, 12.95s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7101, 'learning_rate': 4.259999999999999e-05, 'epoch': 0.17} 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7489, 'learning_rate': 4.319999999999999e-05, 'epoch': 0.17} 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6794, 'learning_rate': 4.3799999999999994e-05, 'epoch': 0.17} 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▌ | 73/2230 [14:15<7:20:02, 12.24s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6963, 'learning_rate': 4.4999999999999996e-05, 'epoch': 0.17} 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 3%|██▋ | 77/2230 [15:03<7:15:10, 12.13s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6342, 'learning_rate': 4.56e-05, 'epoch': 0.18} 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7477, 'learning_rate': 4.62e-05, 'epoch': 0.18} 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|██▊ | 79/2230 [15:26<7:00:56, 11.74s/it] Setting `use_cache=False`...1] 2022-03-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7116, 'learning_rate': 4.68e-05, 'epoch': 0.18} [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:25,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:37,959 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:37,959 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:42,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:42,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7662, 'learning_rate': 4.7999999999999994e-05, 'epoch': 0.19} [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:46,041 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:57,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:57,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.668, 'learning_rate': 4.8599999999999995e-05, 'epoch': 0.19} [WARNING|modeling_utils.py:388] 2022-03-22 16:48:57,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:57,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:48:57,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:08,272 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:08,272 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7007, 'learning_rate': 4.9199999999999997e-05, 'epoch': 0.19} [WARNING|modeling_utils.py:388] 2022-03-22 16:49:08,272 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:08,272 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:16,703 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|███ | 86/2230 [16:41<6:18:49, 10.60s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|███ | 86/2230 [16:41<6:18:49, 10.60s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6541, 'learning_rate': 4.98e-05, 'epoch': 0.19} 4%|███ | 86/2230 [16:41<6:18:49, 10.60s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:24,671 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:24,671 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|███ | 87/2230 [16:51<6:11:12, 10.39s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 4%|███ | 87/2230 [16:51<6:11:12, 10.39s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:30,781 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:41,040 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:41,040 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:45,264 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:45,264 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:49,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:49,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6933, 'learning_rate': 5.1599999999999994e-05, 'epoch': 0.2} [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:53,292 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:55,533 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:55,533 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:49:55,533 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:49:59,257 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:01,399 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:03,496 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:05,514 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:07,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:07,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:09,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:11,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:13,386 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:13,386 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:15,305 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:17,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:18,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:20,679 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:20,679 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:22,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:24,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:27,361 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:27,361 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:29,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:30,515 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:33,446 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:33,446 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:34,935 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:37,619 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:38,932 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:38,932 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:41,495 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:43,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:43,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:46,007 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:48,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:48,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:49,093 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:50,907 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:50,907 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:53,588 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:55,095 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:55,095 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.412, 'learning_rate': 5.82e-05, 'epoch': 0.22} [WARNING|modeling_utils.py:388] 2022-03-22 16:50:58,800 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:50:58,800 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:02,431 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:05,930 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:05,930 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:09,451 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:09,451 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 6.1576, 'learning_rate': 5.88e-05, 'epoch': 0.23} [WARNING|modeling_utils.py:388] 2022-03-22 16:51:13,016 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:16,524 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:16,524 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:20,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:20,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:23,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:23,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.8415, 'learning_rate': 5.94e-05, 'epoch': 0.23} [WARNING|modeling_utils.py:388] 2022-03-22 16:51:26,974 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:30,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:30,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:33,851 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:33,851 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:37,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:37,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:40,693 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:40,693 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:44,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:44,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:47,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.1346, 'learning_rate': 6.0599999999999996e-05, 'epoch': 0.23} [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.858, 'learning_rate': 6.12e-05, 'epoch': 0.24} [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:51:50,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7614, 'learning_rate': 6.18e-05, 'epoch': 0.24} 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7545, 'learning_rate': 6.239999999999999e-05, 'epoch': 0.24} 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▋ | 106/2230 [19:42<7:24:59, 12.57s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6803, 'learning_rate': 6.299999999999999e-05, 'epoch': 0.24} 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6851, 'learning_rate': 6.359999999999999e-05, 'epoch': 0.24} 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 108/2230 [20:09<7:36:31, 12.91s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7716, 'learning_rate': 6.419999999999999e-05, 'epoch': 0.25} 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7469, 'learning_rate': 6.479999999999999e-05, 'epoch': 0.25} 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7064, 'learning_rate': 6.539999999999999e-05, 'epoch': 0.25} 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6541, 'learning_rate': 6.599999999999999e-05, 'epoch': 0.25} 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▊ | 110/2230 [20:35<7:41:21, 13.06s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5913, 'learning_rate': 6.659999999999999e-05, 'epoch': 0.26} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.577, 'learning_rate': 6.72e-05, 'epoch': 0.26} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6932, 'learning_rate': 6.78e-05, 'epoch': 0.26} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6003, 'learning_rate': 6.84e-05, 'epoch': 0.26} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6862, 'learning_rate': 6.9e-05, 'epoch': 0.26} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6215, 'learning_rate': 6.96e-05, 'epoch': 0.27} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6836, 'learning_rate': 7.02e-05, 'epoch': 0.27} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4829, 'learning_rate': 7.079999999999999e-05, 'epoch': 0.27} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5853, 'learning_rate': 7.139999999999999e-05, 'epoch': 0.27} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5679, 'learning_rate': 7.199999999999999e-05, 'epoch': 0.28} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5366, 'learning_rate': 7.259999999999999e-05, 'epoch': 0.28} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6412, 'learning_rate': 7.319999999999999e-05, 'epoch': 0.28} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6462, 'learning_rate': 7.379999999999999e-05, 'epoch': 0.28} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5981, 'learning_rate': 7.439999999999999e-05, 'epoch': 0.28} 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 5%|███▉ | 114/2230 [21:28<7:47:25, 13.25s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.547, 'learning_rate': 7.5e-05, 'epoch': 0.29} 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4812, 'learning_rate': 7.56e-05, 'epoch': 0.29} 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▍ | 128/2230 [24:20<6:56:54, 11.90s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.635, 'learning_rate': 7.62e-05, 'epoch': 0.29} 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▌ | 130/2230 [24:43<6:47:04, 11.63s/it]g-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:57:32,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:57:32,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:57:32,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5837, 'learning_rate': 7.74e-05, 'epoch': 0.3} [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:38,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:57:50,804 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4777, 'learning_rate': 7.8e-05, 'epoch': 0.3} Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5206, 'learning_rate': 7.86e-05, 'epoch': 0.3} Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:10,918 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:10,918 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:10,918 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6975, 'learning_rate': 7.92e-05, 'epoch': 0.3} [WARNING|modeling_utils.py:388] 2022-03-22 16:58:10,918 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:58:19,367 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:58:19,367 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:58:19,367 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:58:19,367 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5397, 'learning_rate': 7.98e-05, 'epoch': 0.3} [WARNING|modeling_utils.py:388] 2022-03-22 16:58:27,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:27,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:58:31,838 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5827, 'learning_rate': 8.04e-05, 'epoch': 0.31} 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 137/2230 [25:57<6:02:55, 10.40s/it] Setting `use_cache=False`...e computed-22 16:40:50,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 138/2230 [26:08<6:12:47, 10.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▊ | 138/2230 [26:08<6:12:47, 10.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6354, 'learning_rate': 8.1e-05, 'epoch': 0.31} [WARNING|modeling_utils.py:388] 2022-03-22 16:58:49,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:51,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:51,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:51,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6395, 'learning_rate': 8.16e-05, 'epoch': 0.31} [WARNING|modeling_utils.py:388] 2022-03-22 16:58:57,622 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:58:57,622 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 16:59:01,641 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 16:58:45,751 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▉ | 140/2230 [26:26<5:44:06, 9.88s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 6%|████▉ | 140/2230 [26:26<5:44:06, 9.88s/it][WARNING|modeling_bart.py:1051] 2022-03-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.654, 'learning_rate': 8.22e-05, 'epoch': 0.31} [WARNING|modeling_utils.py:388] 2022-03-22 16:59:07,496 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:09,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:11,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:11,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:13,600 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:15,512 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:17,435 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:19,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:19,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:21,179 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:23,004 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:26,480 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:26,480 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:28,237 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:29,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:31,459 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:31,459 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:34,681 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:36,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:37,691 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:37,691 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:40,566 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:41,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:44,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:44,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:46,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:47,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:47,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:52,231 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:52,231 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:54,188 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:56,805 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:56,805 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:58,521 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 16:59:59,250 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:01,447 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:01,447 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4768, 'learning_rate': 8.819999999999999e-05, 'epoch': 0.34} [WARNING|modeling_utils.py:388] 2022-03-22 17:00:05,271 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:08,799 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:08,799 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:12,336 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:12,336 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:12,336 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:15,792 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:19,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:19,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:22,775 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:22,775 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:26,193 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:29,654 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:29,654 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.2105, 'learning_rate': 8.939999999999999e-05, 'epoch': 0.34} [WARNING|modeling_utils.py:388] 2022-03-22 17:00:33,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:33,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:36,590 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:39,989 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:39,989 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:43,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:43,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.0365, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.34} [WARNING|modeling_utils.py:388] 2022-03-22 17:00:46,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:46,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:50,247 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:53,609 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:53,609 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:56,920 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:00:56,920 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:01:00,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:01:00,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:01:00,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:01:00,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:01:00,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.814, 'learning_rate': 9.12e-05, 'epoch': 0.35} 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6306, 'learning_rate': 9.18e-05, 'epoch': 0.35} 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▍ | 155/2230 [28:33<7:00:29, 12.16s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5816, 'learning_rate': 9.24e-05, 'epoch': 0.35} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6355, 'learning_rate': 9.3e-05, 'epoch': 0.35} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6574, 'learning_rate': 9.36e-05, 'epoch': 0.36} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6793, 'learning_rate': 9.419999999999999e-05, 'epoch': 0.36} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6908, 'learning_rate': 9.479999999999999e-05, 'epoch': 0.36} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6475, 'learning_rate': 9.539999999999999e-05, 'epoch': 0.36} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5892, 'learning_rate': 9.599999999999999e-05, 'epoch': 0.37} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6243, 'learning_rate': 9.659999999999999e-05, 'epoch': 0.37} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4789, 'learning_rate': 9.719999999999999e-05, 'epoch': 0.37} g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5289, 'learning_rate': 9.779999999999999e-05, 'epoch': 0.37} 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 166/2230 [30:58<7:24:53, 12.93s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4315, 'learning_rate': 9.839999999999999e-05, 'epoch': 0.37} 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 7%|█████▊ | 167/2230 [31:10<7:20:40, 12.82s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.505, 'learning_rate': 9.9e-05, 'epoch': 0.38} 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 168/2230 [31:23<7:16:38, 12.71s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5744, 'learning_rate': 9.96e-05, 'epoch': 0.38} 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 169/2230 [31:35<7:11:51, 12.57s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5639, 'learning_rate': 0.0001002, 'epoch': 0.38} 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 170/2230 [31:47<7:08:10, 12.47s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.563, 'learning_rate': 0.0001008, 'epoch': 0.38} 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6649, 'learning_rate': 0.0001014, 'epoch': 0.39} 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|█████▉ | 171/2230 [32:00<7:05:32, 12.40s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5898, 'learning_rate': 0.0001026, 'epoch': 0.39} 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6714, 'learning_rate': 0.00010319999999999999, 'epoch': 0.39} 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████ | 173/2230 [32:24<7:00:36, 12.27s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4916, 'learning_rate': 0.00010379999999999999, 'epoch': 0.39} 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4783, 'learning_rate': 0.00010439999999999999, 'epoch': 0.4} 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▏ | 176/2230 [33:01<7:04:09, 12.39s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.549, 'learning_rate': 0.00010499999999999999, 'epoch': 0.4} [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:05:57,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6129, 'learning_rate': 0.00010559999999999998, 'epoch': 0.4} 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▎ | 179/2230 [33:36<6:42:37, 11.78s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6554, 'learning_rate': 0.00010619999999999998, 'epoch': 0.4} [WARNING|modeling_utils.py:388] 2022-03-22 17:06:29,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:29,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5494, 'learning_rate': 0.00010679999999999998, 'epoch': 0.41} [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:33,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5649, 'learning_rate': 0.00010739999999999998, 'epoch': 0.41} [WARNING|modeling_utils.py:388] 2022-03-22 17:06:51,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:51,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:56,075 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:56,075 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:06:56,075 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:00,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:00,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:00,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:00,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▍ | 184/2230 [34:31<6:16:30, 11.04s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▍ | 184/2230 [34:31<6:16:30, 11.04s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:10,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:10,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:10,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:10,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▍ | 185/2230 [34:41<6:09:00, 10.83s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▍ | 185/2230 [34:41<6:09:00, 10.83s/it]g-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:20,921 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:20,921 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:20,921 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:20,921 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:07:20,921 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 16:59:03,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▌ | 186/2230 [34:51<6:02:22, 10.64s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▌ | 186/2230 [34:51<6:02:22, 10.64s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▌ | 186/2230 [34:51<6:02:22, 10.64s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4747, 'learning_rate': 0.00011039999999999999, 'epoch': 0.42} [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:35,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:49,398 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:49,398 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4546, 'learning_rate': 0.00011099999999999999, 'epoch': 0.42} [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:49,398 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:55,376 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:57,678 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:07:57,678 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 8%|██████▌ | 189/2230 [35:22<5:51:19, 10.33s/it] Setting `use_cache=False`...1] 2022-03-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:01,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:03,817 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:05,983 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:05,983 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:05,983 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:08:09,963 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:08:12,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:08:14,139 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:08:16,157 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:08:16,157 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5184, 'learning_rate': 0.00011279999999999999, 'epoch': 0.43} [WARNING|modeling_utils.py:388] 2022-03-22 17:08:19,646 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:21,628 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:23,560 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:23,560 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:25,570 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:27,428 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:29,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:31,121 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:31,121 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:32,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:34,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:38,060 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:38,060 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:39,829 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:41,392 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:42,979 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:42,979 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:46,011 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:47,402 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:50,054 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:50,054 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:51,422 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:53,869 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:53,869 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:56,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:58,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:08:58,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:00,358 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:02,146 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:02,146 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:03,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:06,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:06,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:06,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:10,204 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:10,204 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:13,831 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:13,831 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:17,388 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:17,388 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:20,928 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:20,928 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:24,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:24,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:27,904 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:27,904 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:31,352 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:31,352 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:34,766 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:34,766 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:38,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:38,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:41,694 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:41,694 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:45,074 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:48,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:48,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.3307, 'learning_rate': 0.00011999999999999999, 'epoch': 0.46} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:51,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:51,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:55,181 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.0387, 'learning_rate': 0.00012059999999999999, 'epoch': 0.46} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.969, 'learning_rate': 0.00012119999999999999, 'epoch': 0.46} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8809, 'learning_rate': 0.00012179999999999999, 'epoch': 0.46} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6725, 'learning_rate': 0.0001224, 'epoch': 0.46} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7561, 'learning_rate': 0.00012299999999999998, 'epoch': 0.47} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7825, 'learning_rate': 0.0001236, 'epoch': 0.47} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7099, 'learning_rate': 0.00012419999999999998, 'epoch': 0.47} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5271, 'learning_rate': 0.00012479999999999997, 'epoch': 0.47} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.642, 'learning_rate': 0.00012539999999999999, 'epoch': 0.48} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4652, 'learning_rate': 0.00012599999999999997, 'epoch': 0.48} [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:09:58,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5103, 'learning_rate': 0.0001266, 'epoch': 0.48} 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▍ | 214/2230 [39:39<7:22:42, 13.18s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5699, 'learning_rate': 0.00012719999999999997, 'epoch': 0.48} 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 215/2230 [39:51<7:16:16, 12.99s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5555, 'learning_rate': 0.0001278, 'epoch': 0.48} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5327, 'learning_rate': 0.00012839999999999998, 'epoch': 0.49} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5224, 'learning_rate': 0.000129, 'epoch': 0.49} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4599, 'learning_rate': 0.00012959999999999998, 'epoch': 0.49} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5868, 'learning_rate': 0.0001302, 'epoch': 0.49} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4898, 'learning_rate': 0.00013079999999999998, 'epoch': 0.5} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5415, 'learning_rate': 0.0001314, 'epoch': 0.5} 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▌ | 216/2230 [40:04<7:10:42, 12.83s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5053, 'learning_rate': 0.00013199999999999998, 'epoch': 0.5} 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4692, 'learning_rate': 0.0001326, 'epoch': 0.5} 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▊ | 223/2230 [41:30<6:47:44, 12.19s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:27,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:27,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:27,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:27,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6647, 'learning_rate': 0.00013319999999999999, 'epoch': 0.5} [WARNING|modeling_utils.py:388] 2022-03-22 17:14:27,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5009, 'learning_rate': 0.0001338, 'epoch': 0.51} [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:14:39,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▉ | 227/2230 [42:19<6:44:10, 12.11s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▉ | 227/2230 [42:19<6:44:10, 12.11s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▉ | 227/2230 [42:19<6:44:10, 12.11s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|███████▉ | 227/2230 [42:19<6:44:10, 12.11s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5966, 'learning_rate': 0.000135, 'epoch': 0.51} [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:04,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5119, 'learning_rate': 0.0001356, 'epoch': 0.51} 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████ | 229/2230 [42:42<6:34:10, 11.82s/it]g-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.528, 'learning_rate': 0.0001362, 'epoch': 0.52} [WARNING|modeling_utils.py:388] 2022-03-22 17:15:32,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:32,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:37,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:37,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6159, 'learning_rate': 0.0001368, 'epoch': 0.52} [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4335, 'learning_rate': 0.0001374, 'epoch': 0.52} [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:15:41,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:16:01,563 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.452, 'learning_rate': 0.000138, 'epoch': 0.52} 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 10%|████████▏ | 233/2230 [43:26<6:15:39, 11.29s/it] Setting `use_cache=False`...e computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:15,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:25,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:36,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:36,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:36,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:36,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:36,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:46,416 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:46,416 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5933, 'learning_rate': 0.0001404, 'epoch': 0.53} [WARNING|modeling_utils.py:388] 2022-03-22 17:16:46,416 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:46,416 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:46,416 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:56,615 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:16:56,615 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4843, 'learning_rate': 0.00014099999999999998, 'epoch': 0.53} [WARNING|modeling_utils.py:388] 2022-03-22 17:16:56,615 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:17:02,687 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:17:02,687 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:07:29,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 11%|████████▎ | 239/2230 [44:29<5:47:13, 10.46s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 11%|████████▎ | 239/2230 [44:29<5:47:13, 10.46s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4572, 'learning_rate': 0.00014159999999999997, 'epoch': 0.54} [WARNING|modeling_utils.py:388] 2022-03-22 17:17:10,822 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:17:13,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:17:15,301 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:17:15,301 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.528, 'learning_rate': 0.0001422, 'epoch': 0.54} [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:19,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:21,405 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:23,523 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:25,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:25,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:27,575 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:29,544 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:31,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:31,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:33,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:35,272 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:37,058 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:38,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:38,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:40,594 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:43,975 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:45,597 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:45,597 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:47,226 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:50,227 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:50,227 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:51,663 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:54,458 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:55,772 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:55,772 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:58,393 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:59,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:17:59,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:01,825 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:04,038 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:04,038 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:05,985 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:07,880 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:07,880 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:09,667 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:12,124 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:12,830 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:12,830 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3235, 'learning_rate': 0.0001482, 'epoch': 0.56} [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:17,550 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:17,550 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:21,142 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:24,673 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:24,673 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:28,227 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:28,227 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.7329, 'learning_rate': 0.00014879999999999998, 'epoch': 0.56} [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:31,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:31,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:35,309 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:38,735 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:38,735 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:42,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:42,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.6422, 'learning_rate': 0.0001494, 'epoch': 0.57} [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:45,756 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:45,756 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:49,175 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:52,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:52,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:55,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:55,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.3057, 'learning_rate': 0.00015, 'epoch': 0.57} [WARNING|modeling_bart.py:1051] 2022-03-22 17:18:59,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:02,818 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:02,818 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:06,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.933, 'learning_rate': 0.00015059999999999997, 'epoch': 0.57} [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8009, 'learning_rate': 0.0001512, 'epoch': 0.57} [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7696, 'learning_rate': 0.00015179999999999998, 'epoch': 0.57} [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6338, 'learning_rate': 0.0001524, 'epoch': 0.58} [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5604, 'learning_rate': 0.00015299999999999998, 'epoch': 0.58} [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:19:09,539 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6174, 'learning_rate': 0.0001536, 'epoch': 0.58} 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5476, 'learning_rate': 0.00015419999999999998, 'epoch': 0.58} 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6667, 'learning_rate': 0.0001548, 'epoch': 0.59} 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████ | 259/2230 [47:40<7:06:07, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5453, 'learning_rate': 0.00015539999999999998, 'epoch': 0.59} 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.684, 'learning_rate': 0.000156, 'epoch': 0.59} 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5324, 'learning_rate': 0.00015659999999999998, 'epoch': 0.59} 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5577, 'learning_rate': 0.0001572, 'epoch': 0.59} 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.492, 'learning_rate': 0.0001578, 'epoch': 0.6} 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▏ | 262/2230 [48:19<7:05:23, 12.97s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6238, 'learning_rate': 0.0001584, 'epoch': 0.6} 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 267/2230 [49:24<6:57:57, 12.77s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4791, 'learning_rate': 0.000159, 'epoch': 0.6} 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▎ | 268/2230 [49:36<6:53:50, 12.66s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6115, 'learning_rate': 0.0001596, 'epoch': 0.6} 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 269/2230 [49:49<6:51:48, 12.60s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 270/2230 [50:01<6:49:11, 12.53s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▍ | 271/2230 [50:13<6:45:42, 12.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3894, 'learning_rate': 0.000162, 'epoch': 0.61} 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.448, 'learning_rate': 0.0001626, 'epoch': 0.61} 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5522, 'learning_rate': 0.0001632, 'epoch': 0.62} 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▌ | 272/2230 [50:25<6:41:40, 12.31s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4934, 'learning_rate': 0.0001638, 'epoch': 0.62} 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4612, 'learning_rate': 0.0001644, 'epoch': 0.62} 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4943, 'learning_rate': 0.000165, 'epoch': 0.62} 12%|█████████▋ | 276/2230 [51:15<6:42:18, 12.35s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:20,688 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:20,688 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:20,688 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5346, 'learning_rate': 0.0001656, 'epoch': 0.63} 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 279/2230 [51:49<6:24:41, 11.83s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6029, 'learning_rate': 0.0001662, 'epoch': 0.63} [WARNING|modeling_utils.py:388] 2022-03-22 17:24:43,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:43,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:43,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 281/2230 [52:12<6:15:13, 11.55s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 281/2230 [52:12<6:15:13, 11.55s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5479, 'learning_rate': 0.0001668, 'epoch': 0.63} 13%|█████████▊ | 281/2230 [52:12<6:15:13, 11.55s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▊ | 281/2230 [52:12<6:15:13, 11.55s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3756, 'learning_rate': 0.0001674, 'epoch': 0.63} [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:24:57,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:09,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:09,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:09,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:13,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:13,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:13,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:13,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▉ | 284/2230 [52:45<6:00:00, 11.10s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|█████████▉ | 284/2230 [52:45<6:00:00, 11.10s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4843, 'learning_rate': 0.0001686, 'epoch': 0.64} [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5724, 'learning_rate': 0.00016919999999999997, 'epoch': 0.64} [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:25,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.476, 'learning_rate': 0.00016979999999999998, 'epoch': 0.64} [WARNING|modeling_bart.py:1051] 2022-03-22 17:25:46,087 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:25:46,087 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:50,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:50,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:50,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:50,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:56,603 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:56,603 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:25:56,603 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:02,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|██████████ | 288/2230 [53:28<5:53:10, 10.91s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 13%|██████████ | 288/2230 [53:28<5:53:10, 10.91s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.496, 'learning_rate': 0.00017099999999999998, 'epoch': 0.65} [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:08,937 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:08,937 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:12,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:12,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4438, 'learning_rate': 0.00017159999999999997, 'epoch': 0.65} [WARNING|modeling_utils.py:388] 2022-03-22 17:26:16,368 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:18,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:20,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:23,021 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:26:23,021 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4612, 'learning_rate': 0.00017219999999999998, 'epoch': 0.65} [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:27,037 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:29,111 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:31,147 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:31,147 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:33,244 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:35,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:37,132 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:39,006 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:39,006 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:40,938 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:42,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:44,568 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:44,568 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:46,332 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:48,146 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:51,376 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:52,931 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:52,931 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:54,561 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:57,477 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:57,477 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:26:58,870 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:01,579 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:02,829 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:02,829 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:05,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:07,612 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:07,612 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:09,886 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:11,923 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:11,923 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:13,909 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:15,686 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:15,686 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:17,438 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:19,690 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:19,690 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:19,690 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4758, 'learning_rate': 0.00017819999999999997, 'epoch': 0.67} [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:24,451 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:24,451 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:28,019 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:31,548 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:31,548 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:35,056 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:35,056 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.5599, 'learning_rate': 0.00017879999999999998, 'epoch': 0.67} [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:38,631 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:42,127 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:42,127 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:45,545 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:45,545 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:45,545 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:48,986 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:52,511 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:52,511 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:55,893 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:55,893 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:59,290 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:59,290 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:27:59,290 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:02,783 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:06,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:06,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:09,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:12,942 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:12,942 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.996, 'learning_rate': 0.00018059999999999997, 'epoch': 0.68} [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.8184, 'learning_rate': 0.00018119999999999999, 'epoch': 0.68} [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6625, 'learning_rate': 0.00018179999999999997, 'epoch': 0.69} [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.666, 'learning_rate': 0.0001824, 'epoch': 0.69} [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:28:16,268 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6589, 'learning_rate': 0.00018299999999999998, 'epoch': 0.69} 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.641, 'learning_rate': 0.0001836, 'epoch': 0.69} 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7132, 'learning_rate': 0.00018419999999999998, 'epoch': 0.7} 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▊ | 308/2230 [56:33<6:53:23, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5677, 'learning_rate': 0.00018539999999999998, 'epoch': 0.7} 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5749, 'learning_rate': 0.000186, 'epoch': 0.7} 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4865, 'learning_rate': 0.00018659999999999998, 'epoch': 0.7} 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5187, 'learning_rate': 0.0001872, 'epoch': 0.71} 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|██████████▉ | 311/2230 [57:12<6:52:43, 12.90s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4829, 'learning_rate': 0.00018779999999999998, 'epoch': 0.71} 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████ | 316/2230 [58:17<6:50:44, 12.88s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4654, 'learning_rate': 0.00018839999999999997, 'epoch': 0.71} Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4546, 'learning_rate': 0.00018899999999999999, 'epoch': 0.71} Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.471, 'learning_rate': 0.00018959999999999997, 'epoch': 0.72} 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5822, 'learning_rate': 0.0001902, 'epoch': 0.72} 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 319/2230 [58:55<6:39:30, 12.54s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4716, 'learning_rate': 0.00019079999999999998, 'epoch': 0.72} 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▏ | 321/2230 [59:19<6:34:21, 12.39s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3796, 'learning_rate': 0.0001914, 'epoch': 0.72} 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.477, 'learning_rate': 0.00019199999999999998, 'epoch': 0.72} 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3998, 'learning_rate': 0.0001926, 'epoch': 0.73} 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5151, 'learning_rate': 0.00019319999999999998, 'epoch': 0.73} 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5013, 'learning_rate': 0.0001938, 'epoch': 0.73} 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 14%|███████████▎ | 322/2230 [59:31<6:32:04, 12.33s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:09,686 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3593, 'learning_rate': 0.000195, 'epoch': 0.74} [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4115, 'learning_rate': 0.00019559999999999998, 'epoch': 0.74} [WARNING|modeling_utils.py:388] 2022-03-22 17:33:24,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:38,604 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:38,604 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:38,604 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:33:38,604 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5211, 'learning_rate': 0.00019679999999999999, 'epoch': 0.74} 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▏ | 330/2230 [1:01:07<6:12:18, 11.76s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3388, 'learning_rate': 0.0001974, 'epoch': 0.74} 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5557, 'learning_rate': 0.000198, 'epoch': 0.75} 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▎ | 332/2230 [1:01:30<6:03:14, 11.48s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:27,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:27,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:27,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:31,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:31,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:31,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:31,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▍ | 335/2230 [1:02:02<5:49:16, 11.06s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▍ | 335/2230 [1:02:02<5:49:16, 11.06s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5496, 'learning_rate': 0.0001992, 'epoch': 0.75} 15%|███████████▍ | 335/2230 [1:02:02<5:49:16, 11.06s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▍ | 335/2230 [1:02:02<5:49:16, 11.06s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:47,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:47,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:47,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4278, 'learning_rate': 0.0001998, 'epoch': 0.75} [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4312, 'learning_rate': 0.0002004, 'epoch': 0.76} [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:34:53,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:12,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:12,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4803, 'learning_rate': 0.000201, 'epoch': 0.76} [WARNING|modeling_utils.py:388] 2022-03-22 17:35:12,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:18,214 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:18,214 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▌ | 339/2230 [1:02:45<5:36:11, 10.67s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▌ | 339/2230 [1:02:45<5:36:11, 10.67s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:24,252 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:24,252 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:28,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:30,726 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:30,726 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5558, 'learning_rate': 0.0002022, 'epoch': 0.76} [WARNING|modeling_utils.py:388] 2022-03-22 17:35:34,645 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:36,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:39,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:39,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:39,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:43,072 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:45,137 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:35:47,155 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▋ | 342/2230 [1:03:12<4:56:41, 9.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 15%|███████████▋ | 342/2230 [1:03:12<4:56:41, 9.43s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:50,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:52,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:54,346 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:56,188 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:56,188 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:58,055 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:35:59,824 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:03,155 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:03,155 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:04,838 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:06,442 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:07,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:11,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:11,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:12,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:15,158 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:15,158 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:17,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:19,029 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:21,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:21,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:22,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:24,595 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:24,595 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:27,453 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:29,202 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:29,202 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:30,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:32,819 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:32,819 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0596, 'learning_rate': 0.00020819999999999996, 'epoch': 0.78} [WARNING|modeling_utils.py:388] 2022-03-22 17:36:36,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:36,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:40,072 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:43,585 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:43,585 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:43,585 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:47,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:47,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:50,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:50,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:54,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:57,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:57,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:36:57,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:01,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:01,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:04,572 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:08,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:08,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:11,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:11,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:11,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:14,775 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:18,185 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:18,185 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6417, 'learning_rate': 0.00021059999999999997, 'epoch': 0.79} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7513, 'learning_rate': 0.00021119999999999996, 'epoch': 0.8} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7933, 'learning_rate': 0.00021179999999999997, 'epoch': 0.8} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6334, 'learning_rate': 0.00021239999999999996, 'epoch': 0.8} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6693, 'learning_rate': 0.00021299999999999997, 'epoch': 0.8} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5336, 'learning_rate': 0.00021359999999999996, 'epoch': 0.8} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5268, 'learning_rate': 0.00021419999999999998, 'epoch': 0.81} [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:37:21,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4647, 'learning_rate': 0.00021479999999999996, 'epoch': 0.81} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4749, 'learning_rate': 0.00021539999999999998, 'epoch': 0.81} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4175, 'learning_rate': 0.00021599999999999996, 'epoch': 0.81} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4526, 'learning_rate': 0.00021659999999999998, 'epoch': 0.82} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5824, 'learning_rate': 0.00021719999999999997, 'epoch': 0.82} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3921, 'learning_rate': 0.00021779999999999998, 'epoch': 0.82} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5027, 'learning_rate': 0.00021839999999999997, 'epoch': 0.82} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4884, 'learning_rate': 0.00021899999999999998, 'epoch': 0.83} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4069, 'learning_rate': 0.00021959999999999997, 'epoch': 0.83} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3835, 'learning_rate': 0.00022019999999999999, 'epoch': 0.83} 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 16%|████████████▎ | 361/2230 [1:06:24<6:43:39, 12.96s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4292, 'learning_rate': 0.0002214, 'epoch': 0.83} 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3984, 'learning_rate': 0.00022199999999999998, 'epoch': 0.84} 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2254, 'learning_rate': 0.0002226, 'epoch': 0.84} 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4513, 'learning_rate': 0.00022319999999999998, 'epoch': 0.84} 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▋ | 371/2230 [1:08:32<6:27:24, 12.50s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3402, 'learning_rate': 0.0002238, 'epoch': 0.84} [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4062, 'learning_rate': 0.00022439999999999998, 'epoch': 0.85} [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3849, 'learning_rate': 0.000225, 'epoch': 0.85} [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:42:10,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▉ | 379/2230 [1:10:08<6:05:46, 11.86s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|████████████▉ | 379/2230 [1:10:08<6:05:46, 11.86s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2675, 'learning_rate': 0.00022559999999999998, 'epoch': 0.85} 17%|████████████▉ | 379/2230 [1:10:08<6:05:46, 11.86s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3187, 'learning_rate': 0.00022619999999999997, 'epoch': 0.85} [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:42:52,022 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:06,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:06,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:06,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3882, 'learning_rate': 0.00022679999999999998, 'epoch': 0.85} [WARNING|modeling_utils.py:388] 2022-03-22 17:43:06,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:06,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.426, 'learning_rate': 0.00022739999999999997, 'epoch': 0.86} [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:16,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 383/2230 [1:10:53<5:48:45, 11.33s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 383/2230 [1:10:53<5:48:45, 11.33s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3767, 'learning_rate': 0.00022799999999999999, 'epoch': 0.86} 17%|█████████████ | 383/2230 [1:10:53<5:48:45, 11.33s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:36,962 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:43:36,962 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3028, 'learning_rate': 0.00022859999999999997, 'epoch': 0.86} 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████ | 384/2230 [1:11:04<5:43:15, 11.16s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:43:53,570 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:43:53,570 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:43:53,570 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:43:53,570 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:01,632 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:01,632 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2933, 'learning_rate': 0.00022979999999999997, 'epoch': 0.87} [WARNING|modeling_utils.py:388] 2022-03-22 17:44:01,632 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:01,632 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:01,632 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4037, 'learning_rate': 0.0002304, 'epoch': 0.87} [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:11,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 17%|█████████████▏ | 388/2230 [1:11:46<5:33:03, 10.85s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:25,889 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:25,889 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:30,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:30,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:30,232 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:34,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:34,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:38,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:38,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:42,227 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:42,227 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:44,516 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:46,707 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:46,707 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:50,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:50,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:52,811 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:44:52,811 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:56,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:58,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:44:58,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:00,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:02,390 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:04,337 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:06,339 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:06,339 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:08,291 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:10,137 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:11,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:13,705 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:13,705 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:15,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:18,709 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:20,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:20,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:21,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:24,794 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:26,127 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:26,127 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:28,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:30,032 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:30,032 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:32,488 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:34,676 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:36,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:36,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:38,592 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:41,261 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:41,261 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:42,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:42,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:42,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:46,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:46,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:50,638 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:50,638 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:54,224 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:54,224 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:57,741 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:45:57,741 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:01,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:01,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:04,773 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:04,773 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:08,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:11,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:11,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:11,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:15,102 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:15,102 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:18,571 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:22,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:22,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:25,397 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:25,397 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.7643, 'learning_rate': 0.00023999999999999998, 'epoch': 0.9} [WARNING|modeling_utils.py:388] 2022-03-22 17:46:28,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:32,189 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:32,189 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:32,189 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:46:32,189 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5921, 'learning_rate': 0.0002406, 'epoch': 0.91} 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5268, 'learning_rate': 0.00024119999999999998, 'epoch': 0.91} 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 404/2230 [1:14:04<5:56:20, 11.71s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5454, 'learning_rate': 0.0002418, 'epoch': 0.91} 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5719, 'learning_rate': 0.00024239999999999998, 'epoch': 0.91} 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▊ | 406/2230 [1:14:30<6:22:31, 12.58s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5135, 'learning_rate': 0.000243, 'epoch': 0.91} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4509, 'learning_rate': 0.00024359999999999999, 'epoch': 0.92} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.395, 'learning_rate': 0.00024419999999999997, 'epoch': 0.92} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5676, 'learning_rate': 0.0002448, 'epoch': 0.92} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3981, 'learning_rate': 0.00024539999999999995, 'epoch': 0.92} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3338, 'learning_rate': 0.00024599999999999996, 'epoch': 0.93} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4346, 'learning_rate': 0.0002466, 'epoch': 0.93} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4811, 'learning_rate': 0.0002472, 'epoch': 0.93} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3513, 'learning_rate': 0.00024779999999999995, 'epoch': 0.93} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3441, 'learning_rate': 0.00024839999999999997, 'epoch': 0.93} 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 18%|█████████████▉ | 408/2230 [1:14:57<6:29:54, 12.84s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▏ | 418/2230 [1:17:05<6:20:17, 12.59s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▏ | 418/2230 [1:17:05<6:20:17, 12.59s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3528, 'learning_rate': 0.00024959999999999994, 'epoch': 0.94} [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2688, 'learning_rate': 0.00025019999999999996, 'epoch': 0.94} [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3758, 'learning_rate': 0.00025079999999999997, 'epoch': 0.94} [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2928, 'learning_rate': 0.0002514, 'epoch': 0.95} [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3609, 'learning_rate': 0.00025199999999999995, 'epoch': 0.95} [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:49:46,787 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4505, 'learning_rate': 0.00025259999999999996, 'epoch': 0.95} [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3456, 'learning_rate': 0.0002532, 'epoch': 0.95} [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:50:52,394 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 426/2230 [1:18:42<6:04:08, 12.11s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 426/2230 [1:18:42<6:04:08, 12.11s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.407, 'learning_rate': 0.0002538, 'epoch': 0.96} 19%|██████████████▌ | 426/2230 [1:18:42<6:04:08, 12.11s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 426/2230 [1:18:42<6:04:08, 12.11s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2921, 'learning_rate': 0.00025439999999999995, 'epoch': 0.96} [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:51:27,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 428/2230 [1:19:04<5:51:10, 11.69s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 428/2230 [1:19:04<5:51:10, 11.69s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2807, 'learning_rate': 0.00025499999999999996, 'epoch': 0.96} 19%|██████████████▌ | 428/2230 [1:19:04<5:51:10, 11.69s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 428/2230 [1:19:04<5:51:10, 11.69s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▌ | 428/2230 [1:19:04<5:51:10, 11.69s/it]g-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.1697, 'learning_rate': 0.0002556, 'epoch': 0.96} [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:51:52,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▋ | 430/2230 [1:19:27<5:42:26, 11.41s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▋ | 430/2230 [1:19:27<5:42:26, 11.41s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3221, 'learning_rate': 0.0002562, 'epoch': 0.96} 19%|██████████████▋ | 430/2230 [1:19:27<5:42:26, 11.41s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 19%|██████████████▋ | 430/2230 [1:19:27<5:42:26, 11.41s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:11,809 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:11,809 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:11,809 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:11,809 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2951, 'learning_rate': 0.00025679999999999995, 'epoch': 0.97} [WARNING|modeling_utils.py:388] 2022-03-22 17:52:11,809 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4834, 'learning_rate': 0.00025739999999999997, 'epoch': 0.97} [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:22,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:52:34,792 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:52:34,792 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3382, 'learning_rate': 0.000258, 'epoch': 0.97} [WARNING|modeling_bart.py:1051] 2022-03-22 17:52:34,792 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:52:34,792 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:42,726 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:42,726 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:42,726 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2568, 'learning_rate': 0.0002586, 'epoch': 0.97} [WARNING|modeling_utils.py:388] 2022-03-22 17:52:48,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:48,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:52:53,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|██████████████▊ | 435/2230 [1:20:18<5:07:11, 10.27s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|██████████████▊ | 435/2230 [1:20:18<5:07:11, 10.27s/it] Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:57,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:52:57,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:01,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:01,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:01,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:05,292 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:05,292 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:09,213 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:11,312 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:17:06,966 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|██████████████▉ | 437/2230 [1:20:36<4:46:01, 9.57s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|██████████████▉ | 437/2230 [1:20:36<4:46:01, 9.57s/it][WARNING|modeling_bart.py:1051] 2022-03-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:15,498 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:15,498 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:53:15,498 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:21,636 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:21,636 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:23,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:25,625 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:27,450 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:29,355 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:29,355 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:31,211 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:32,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:34,645 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:34,645 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:37,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:39,540 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:41,073 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:42,563 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:42,563 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:45,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:48,022 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:48,022 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:49,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:51,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:54,047 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:54,047 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:56,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:58,149 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:58,149 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:59,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:53:59,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:01,694 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:03,944 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:03,944 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5722, 'learning_rate': 0.00026579999999999996, 'epoch': 1.0} [WARNING|modeling_utils.py:388] 2022-03-22 17:54:06,395 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:09,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:09,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:13,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:13,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:17,081 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:17,081 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.523, 'learning_rate': 0.00026639999999999997, 'epoch': 1.0} [WARNING|modeling_utils.py:388] 2022-03-22 17:54:20,699 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:24,157 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:24,157 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:27,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:27,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:30,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:30,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:34,408 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:34,408 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:37,797 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:41,158 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:41,158 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:44,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:44,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:44,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:48,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:51,501 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:51,501 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:54,968 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:54,968 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6654, 'learning_rate': 0.00026819999999999996, 'epoch': 1.01} [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.5437, 'learning_rate': 0.0002688, 'epoch': 1.01} [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4698, 'learning_rate': 0.0002694, 'epoch': 1.01} [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 17:54:58,327 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3267, 'learning_rate': 0.00027059999999999996, 'epoch': 1.02} 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▍ | 453/2230 [1:23:05<6:23:22, 12.94s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2144, 'learning_rate': 0.0002712, 'epoch': 1.02} 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2471, 'learning_rate': 0.0002718, 'epoch': 1.02} 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0365, 'learning_rate': 0.0002724, 'epoch': 1.02} 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 20%|███████████████▌ | 455/2230 [1:23:31<6:27:00, 13.08s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9685, 'learning_rate': 0.00027299999999999997, 'epoch': 1.03} 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▌ | 458/2230 [1:24:10<6:22:21, 12.95s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.1301, 'learning_rate': 0.0002736, 'epoch': 1.03} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.063, 'learning_rate': 0.0002742, 'epoch': 1.03} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.067, 'learning_rate': 0.0002748, 'epoch': 1.03} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9844, 'learning_rate': 0.00027539999999999997, 'epoch': 1.04} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0705, 'learning_rate': 0.000276, 'epoch': 1.04} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9171, 'learning_rate': 0.0002766, 'epoch': 1.04} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9141, 'learning_rate': 0.0002772, 'epoch': 1.04} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9032, 'learning_rate': 0.0002778, 'epoch': 1.04} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0287, 'learning_rate': 0.0002784, 'epoch': 1.05} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8513, 'learning_rate': 0.000279, 'epoch': 1.05} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8698, 'learning_rate': 0.00027959999999999997, 'epoch': 1.05} 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|███████████████▋ | 459/2230 [1:24:23<6:21:39, 12.93s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0594, 'learning_rate': 0.0002802, 'epoch': 1.05} [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9681, 'learning_rate': 0.0002808, 'epoch': 1.06} [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 17:59:21,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9133, 'learning_rate': 0.00028139999999999996, 'epoch': 1.06} 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8527, 'learning_rate': 0.00028199999999999997, 'epoch': 1.06} 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████ | 472/2230 [1:27:04<5:50:22, 11.96s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 474/2230 [1:27:27<5:44:04, 11.76s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 474/2230 [1:27:27<5:44:04, 11.76s/it] Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7028, 'learning_rate': 0.0002826, 'epoch': 1.06} [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.678, 'learning_rate': 0.00028319999999999994, 'epoch': 1.07} [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:08,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8978, 'learning_rate': 0.00028379999999999996, 'epoch': 1.07} 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7861, 'learning_rate': 0.0002844, 'epoch': 1.07} 21%|████████████████▏ | 476/2230 [1:27:52<5:49:56, 11.97s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:45,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:45,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:45,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:45,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▎ | 478/2230 [1:28:14<5:37:01, 11.54s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:54,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:54,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:54,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:00:54,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▎ | 479/2230 [1:28:25<5:31:27, 11.36s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▎ | 479/2230 [1:28:25<5:31:27, 11.36s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8174, 'learning_rate': 0.00028559999999999995, 'epoch': 1.07} 21%|████████████████▎ | 479/2230 [1:28:25<5:31:27, 11.36s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▎ | 479/2230 [1:28:25<5:31:27, 11.36s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 21%|████████████████▎ | 479/2230 [1:28:25<5:31:27, 11.36s/it]g-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:12,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:12,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8163, 'learning_rate': 0.00028619999999999996, 'epoch': 1.08} [WARNING|modeling_utils.py:388] 2022-03-22 18:01:12,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:12,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:12,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:01:22,794 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:01:22,794 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9948, 'learning_rate': 0.0002868, 'epoch': 1.08} [WARNING|modeling_bart.py:1051] 2022-03-22 18:01:22,794 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:28,539 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:28,539 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:28,539 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:28,539 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.752, 'learning_rate': 0.00028739999999999994, 'epoch': 1.08} [WARNING|modeling_utils.py:388] 2022-03-22 18:01:28,539 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:38,704 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:38,704 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:38,704 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:38,704 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:45,012 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:45,012 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:45,012 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:50,994 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:50,994 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:50,994 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7391, 'learning_rate': 0.00028859999999999997, 'epoch': 1.09} [WARNING|modeling_utils.py:388] 2022-03-22 18:01:56,990 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:59,292 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:01:59,292 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 17:53:13,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 22%|████████████████▌ | 485/2230 [1:29:26<4:51:42, 10.03s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 22%|████████████████▌ | 485/2230 [1:29:26<4:51:42, 10.03s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7673, 'learning_rate': 0.0002892, 'epoch': 1.09} [WARNING|modeling_utils.py:388] 2022-03-22 18:02:07,347 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:02:09,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:02:11,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:02:11,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.6326, 'learning_rate': 0.00028979999999999994, 'epoch': 1.09} [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:15,773 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:17,903 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:20,020 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:20,020 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:22,157 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:24,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:28,206 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:30,309 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:30,309 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:32,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:34,496 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:36,442 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:38,371 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:38,371 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:40,380 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:42,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:43,990 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:43,990 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:45,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:47,497 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:49,140 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:52,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:52,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:53,972 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:56,755 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:56,755 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:02:58,066 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:00,713 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:01,948 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:01,948 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:04,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:06,438 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:06,438 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:08,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:10,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:10,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:13,114 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:14,651 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:14,651 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.8942, 'learning_rate': 0.0002958, 'epoch': 1.11} [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:18,084 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:18,084 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:21,874 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:21,874 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:25,600 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:29,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:29,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.571, 'learning_rate': 0.0002964, 'epoch': 1.11} [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:33,068 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:33,068 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:36,609 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:36,609 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:40,379 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:40,379 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:40,379 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:44,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:44,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:47,886 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:47,886 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:51,536 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:55,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:55,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.784, 'learning_rate': 0.00029759999999999997, 'epoch': 1.12} [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:03:58,739 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4678, 'learning_rate': 0.0002982, 'epoch': 1.12} [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/22/2022 18:14:18 - INFO - datasets.metric - Removing /home/sanchit_huggingface_co/.cache/huggingface/metrics/wer/default/default_experiment-1-0.arrow {'eval_loss': 4.352957248687744, 'eval_wer': 1.7716977389924633, 'eval_runtime': 602.6395, 'eval_samples_per_second': 4.384, 'eval_steps_per_second': 0.549, 'epoch': 1.12} [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/22/2022 18:14:37 - WARNING - huggingface_hub.repository - Adding files tracked by Git LFS: ['wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb']. This may take a bit of time if the files are large. [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2885, 'learning_rate': 0.0002988, 'epoch': 1.12} [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3183, 'learning_rate': 0.00029939999999999996, 'epoch': 1.13} [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2899, 'learning_rate': 0.0003, 'epoch': 1.13} [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 18:04:15,945 >> Num examples = 2642rue` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0863, 'learning_rate': 0.00029982658959537567, 'epoch': 1.13} 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.1456, 'learning_rate': 0.0002996531791907514, 'epoch': 1.13} 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2061, 'learning_rate': 0.00029947976878612716, 'epoch': 1.13} 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.041, 'learning_rate': 0.00029930635838150286, 'epoch': 1.14} 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|████████████████▉ | 504/2230 [1:43:17<38:14:27, 79.76s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2261, 'learning_rate': 0.0002991329479768786, 'epoch': 1.14} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.072, 'learning_rate': 0.0002989595375722543, 'epoch': 1.14} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0255, 'learning_rate': 0.00029878612716763005, 'epoch': 1.14} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9969, 'learning_rate': 0.00029861271676300574, 'epoch': 1.15} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0715, 'learning_rate': 0.0002984393063583815, 'epoch': 1.15} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0107, 'learning_rate': 0.0002982658959537572, 'epoch': 1.15} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8789, 'learning_rate': 0.00029809248554913293, 'epoch': 1.15} 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████ | 508/2230 [1:44:12<14:08:12, 29.55s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 515/2230 [1:45:44<6:50:21, 14.36s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.005, 'learning_rate': 0.00029774566473988437, 'epoch': 1.16} 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 516/2230 [1:45:56<6:33:45, 13.78s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9039, 'learning_rate': 0.00029757225433526006, 'epoch': 1.16} 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▌ | 517/2230 [1:46:09<6:21:00, 13.35s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9395, 'learning_rate': 0.0002973988439306358, 'epoch': 1.16} 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9881, 'learning_rate': 0.00029722543352601156, 'epoch': 1.16} 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9593, 'learning_rate': 0.00029705202312138725, 'epoch': 1.17} 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8773, 'learning_rate': 0.00029687861271676295, 'epoch': 1.17} 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8301, 'learning_rate': 0.0002967052023121387, 'epoch': 1.17} 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▋ | 518/2230 [1:46:21<6:10:54, 13.00s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8361, 'learning_rate': 0.00029653179190751444, 'epoch': 1.17} 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.9224, 'learning_rate': 0.00029635838150289014, 'epoch': 1.17} 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8038, 'learning_rate': 0.00029618497109826583, 'epoch': 1.18} 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 23%|█████████████████▊ | 523/2230 [1:47:21<5:42:58, 12.06s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 526/2230 [1:47:57<5:44:10, 12.12s/it] Setting `use_cache=False`...e computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7414, 'learning_rate': 0.0002958381502890173, 'epoch': 1.18} [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:20:37,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:02:03,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8548, 'learning_rate': 0.000295664739884393, 'epoch': 1.18} 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|█████████████████▉ | 528/2230 [1:48:20<5:31:34, 11.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:10,059 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:10,059 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:14,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:14,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:14,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:14,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:19,938 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:19,938 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:23,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:23,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:23,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:23,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:23,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7377, 'learning_rate': 0.0002951445086705202, 'epoch': 1.19} [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.809, 'learning_rate': 0.0002949710982658959, 'epoch': 1.19} [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:34,393 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|██████████████████▏ | 533/2230 [1:49:13<5:01:15, 10.65s/it]g-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|██████████████████▏ | 533/2230 [1:49:13<5:01:15, 10.65s/it]g-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:52,295 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:21:52,295 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:21:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:00,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:00,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:00,808 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:06,721 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:09,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:09,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.6233, 'learning_rate': 0.0002944508670520231, 'epoch': 1.2} [WARNING|modeling_utils.py:388] 2022-03-22 18:22:09,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:14,838 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:17,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:17,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:17,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:21,247 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:23,458 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:25,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:25,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:20:57,677 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 24%|██████████████████▎ | 537/2230 [1:49:50<4:29:08, 9.54s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:29,834 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:29,834 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:22:29,834 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:36,017 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:36,017 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:38,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:39,975 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:41,839 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:43,684 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:43,684 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:45,585 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:47,360 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:49,054 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:49,054 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:50,712 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:54,070 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:55,609 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:55,609 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:22:57,116 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:00,022 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:01,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:01,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:04,140 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:05,398 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:07,863 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:07,863 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:10,185 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:12,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:12,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:14,309 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:16,119 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:16,119 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:17,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:17,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:19,527 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:21,834 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:21,834 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:25,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:25,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:29,051 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:32,640 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:32,640 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 6.2992, 'learning_rate': 0.0002923699421965318, 'epoch': 1.23} [WARNING|modeling_utils.py:388] 2022-03-22 18:23:36,208 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:36,208 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:39,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:39,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:43,176 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:46,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:46,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.2557, 'learning_rate': 0.0002921965317919075, 'epoch': 1.23} [WARNING|modeling_utils.py:388] 2022-03-22 18:23:50,164 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:50,164 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:53,567 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:57,001 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:23:57,001 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:00,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:00,430 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.6328, 'learning_rate': 0.0002920231213872832, 'epoch': 1.23} [WARNING|modeling_utils.py:388] 2022-03-22 18:24:03,920 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:07,342 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:07,342 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.4602, 'learning_rate': 0.0002918497109826589, 'epoch': 1.23} [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.3019, 'learning_rate': 0.0002916763005780347, 'epoch': 1.24} [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.1883, 'learning_rate': 0.00029150289017341037, 'epoch': 1.24} [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:24:10,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.2302, 'learning_rate': 0.0002913294797687861, 'epoch': 1.24} 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0509, 'learning_rate': 0.00029115606936416186, 'epoch': 1.24} 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▊ | 553/2230 [1:52:21<6:04:59, 13.06s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.0135, 'learning_rate': 0.00029098265895953756, 'epoch': 1.24} 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8072, 'learning_rate': 0.00029080924855491325, 'epoch': 1.25} 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|██████████████████▉ | 555/2230 [1:52:48<6:07:51, 13.18s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8271, 'learning_rate': 0.000290635838150289, 'epoch': 1.25} [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8352, 'learning_rate': 0.00029046242774566475, 'epoch': 1.25} [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7885, 'learning_rate': 0.00029028901734104044, 'epoch': 1.25} [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.8345, 'learning_rate': 0.00029011560693641613, 'epoch': 1.26} [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7898, 'learning_rate': 0.0002899421965317919, 'epoch': 1.26} [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:25:47,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.6793, 'learning_rate': 0.00028976878612716763, 'epoch': 1.26} g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7274, 'learning_rate': 0.0002895953757225433, 'epoch': 1.26} 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 25%|███████████████████▏ | 563/2230 [1:54:32<6:09:23, 13.30s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5994, 'learning_rate': 0.00028942196531791907, 'epoch': 1.26} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5573, 'learning_rate': 0.00028924855491329476, 'epoch': 1.27} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5576, 'learning_rate': 0.0002890751445086705, 'epoch': 1.27} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.577, 'learning_rate': 0.0002889017341040462, 'epoch': 1.27} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.6755, 'learning_rate': 0.00028872832369942195, 'epoch': 1.27} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.39, 'learning_rate': 0.00028855491329479765, 'epoch': 1.28} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5657, 'learning_rate': 0.0002883815028901734, 'epoch': 1.28} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.4084, 'learning_rate': 0.0002882080924855491, 'epoch': 1.28} [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:27:26,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▍ | 572/2230 [1:56:23<5:34:46, 12.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▍ | 572/2230 [1:56:23<5:34:46, 12.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.4383, 'learning_rate': 0.00028786127167630053, 'epoch': 1.28} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.3082, 'learning_rate': 0.0002876878612716763, 'epoch': 1.29} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:04,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.299, 'learning_rate': 0.000287514450867052, 'epoch': 1.29} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:41,159 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:41,159 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:41,159 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:41,159 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:41,159 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.323, 'learning_rate': 0.0002873410404624277, 'epoch': 1.29} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.3031, 'learning_rate': 0.0002871676300578034, 'epoch': 1.29} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.1504, 'learning_rate': 0.00028699421965317916, 'epoch': 1.3} [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:29:51,096 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▋ | 579/2230 [1:57:45<5:15:13, 11.46s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▋ | 579/2230 [1:57:45<5:15:13, 11.46s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.2823, 'learning_rate': 0.0002868208092485549, 'epoch': 1.3} 26%|███████████████████▋ | 579/2230 [1:57:45<5:15:13, 11.46s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▋ | 579/2230 [1:57:45<5:15:13, 11.46s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▋ | 579/2230 [1:57:45<5:15:13, 11.46s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:30:32,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:30:32,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.182, 'learning_rate': 0.0002866473988439306, 'epoch': 1.3} [WARNING|modeling_utils.py:388] 2022-03-22 18:30:32,432 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.3284, 'learning_rate': 0.00028647398843930635, 'epoch': 1.3} [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:30:38,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.1441, 'learning_rate': 0.0002863005780346821, 'epoch': 1.3} [WARNING|modeling_utils.py:388] 2022-03-22 18:30:58,608 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:30:58,608 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:30:58,608 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:04,927 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:04,927 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.0156, 'learning_rate': 0.0002861271676300578, 'epoch': 1.31} [WARNING|modeling_utils.py:388] 2022-03-22 18:31:04,927 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:11,086 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:11,086 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:11,086 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.8212, 'learning_rate': 0.0002859537572254335, 'epoch': 1.31} [WARNING|modeling_utils.py:388] 2022-03-22 18:31:17,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:17,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:21,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▉ | 585/2230 [1:58:46<4:38:32, 10.16s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 26%|███████████████████▉ | 585/2230 [1:58:46<4:38:32, 10.16s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.73, 'learning_rate': 0.00028578034682080923, 'epoch': 1.31} [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:27,226 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:29,446 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:31,578 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:33,781 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:33,781 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.8376, 'learning_rate': 0.000285606936416185, 'epoch': 1.31} [WARNING|modeling_utils.py:388] 2022-03-22 18:31:37,284 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:39,349 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:41,424 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:41,424 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:43,369 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:31:43,369 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:47,567 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:49,412 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:49,412 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:51,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:53,244 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:54,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:54,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:56,698 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:31:58,459 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:01,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:01,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:03,312 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:04,922 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:07,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:07,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:09,196 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:11,941 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:13,218 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:13,218 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:15,798 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:18,094 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:18,094 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:20,338 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:21,391 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:21,391 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:23,411 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:26,223 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:26,223 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:28,032 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:30,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:30,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.888, 'learning_rate': 0.0002838728323699422, 'epoch': 1.34} [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:33,798 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:33,798 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:37,453 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:37,453 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:41,005 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:44,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:44,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 6.5668, 'learning_rate': 0.0002836994219653179, 'epoch': 1.34} [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:48,021 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:48,021 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:51,565 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:55,131 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:55,131 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:58,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:32:58,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 5.3153, 'learning_rate': 0.00028352601156069363, 'epoch': 1.34} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:02,098 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:02,098 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:05,537 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:08,977 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:08,977 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:12,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:12,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 4.1101, 'learning_rate': 0.0002833526011560693, 'epoch': 1.34} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:15,922 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:19,305 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:19,305 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:22,666 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:22,666 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.7808, 'learning_rate': 0.00028317919075144507, 'epoch': 1.35} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.4887, 'learning_rate': 0.00028300578034682076, 'epoch': 1.35} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5204, 'learning_rate': 0.0002828323699421965, 'epoch': 1.35} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.3659, 'learning_rate': 0.00028265895953757226, 'epoch': 1.35} [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:33:26,077 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.2103, 'learning_rate': 0.00028248554913294795, 'epoch': 1.35} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.0861, 'learning_rate': 0.00028231213872832365, 'epoch': 1.36} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.8952, 'learning_rate': 0.0002821387283236994, 'epoch': 1.36} 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.9341, 'learning_rate': 0.00028196531791907514, 'epoch': 1.36} 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.7669, 'learning_rate': 0.00028179190751445083, 'epoch': 1.36} 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.687, 'learning_rate': 0.0002816184971098266, 'epoch': 1.37} 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▋ | 606/2230 [2:02:13<6:00:00, 13.30s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.5708, 'learning_rate': 0.0002814450867052023, 'epoch': 1.37} 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.3895, 'learning_rate': 0.000281271676300578, 'epoch': 1.37} 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.38, 'learning_rate': 0.0002810982658959537, 'epoch': 1.37} 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.4185, 'learning_rate': 0.00028092485549132947, 'epoch': 1.37} 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 27%|████████████████████▊ | 610/2230 [2:03:04<5:50:36, 12.99s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.3059, 'learning_rate': 0.00028075144508670516, 'epoch': 1.38} 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 614/2230 [2:03:57<5:52:38, 13.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.2793, 'learning_rate': 0.0002805780346820809, 'epoch': 1.38} 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 615/2230 [2:04:09<5:46:39, 12.88s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.1804, 'learning_rate': 0.0002804046242774566, 'epoch': 1.38} 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|████████████████████▉ | 616/2230 [2:04:22<5:42:45, 12.74s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.1364, 'learning_rate': 0.00028023121387283235, 'epoch': 1.38} 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 617/2230 [2:04:34<5:39:41, 12.64s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.0321, 'learning_rate': 0.00028005780346820804, 'epoch': 1.39} 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 618/2230 [2:04:46<5:36:05, 12.51s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.03, 'learning_rate': 0.0002798843930635838, 'epoch': 1.39} 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.8484, 'learning_rate': 0.00027971098265895954, 'epoch': 1.39} 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████ | 619/2230 [2:04:58<5:32:31, 12.38s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.004, 'learning_rate': 0.00027953757225433523, 'epoch': 1.39} [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.9594, 'learning_rate': 0.0002793641618497109, 'epoch': 1.39} [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:37:56,323 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.8258, 'learning_rate': 0.00027919075144508667, 'epoch': 1.4} [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.8189, 'learning_rate': 0.0002790173410404624, 'epoch': 1.4} [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.7799, 'learning_rate': 0.0002788439306358381, 'epoch': 1.4} [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:38:22,816 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.6976, 'learning_rate': 0.00027867052023121386, 'epoch': 1.4} 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.6733, 'learning_rate': 0.00027849710982658955, 'epoch': 1.41} 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▎ | 626/2230 [2:06:22<5:23:09, 12.09s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.703, 'learning_rate': 0.0002783236994219653, 'epoch': 1.41} 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.7001, 'learning_rate': 0.000278150289017341, 'epoch': 1.41} 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 28%|█████████████████████▍ | 628/2230 [2:06:45<5:12:30, 11.70s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:41,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:41,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:45,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:45,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.608, 'learning_rate': 0.00027797687861271674, 'epoch': 1.41} [WARNING|modeling_utils.py:388] 2022-03-22 18:39:49,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:49,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:49,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:49,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:49,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.7678, 'learning_rate': 0.0002778034682080925, 'epoch': 1.41} [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.4661, 'learning_rate': 0.0002776300578034682, 'epoch': 1.42} [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:39:59,673 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:11,959 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:11,959 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:11,959 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.4454, 'learning_rate': 0.0002774566473988439, 'epoch': 1.42} [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:11,959 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:19,859 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:19,859 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:24,327 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:24,327 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2762, 'learning_rate': 0.0002772832369942196, 'epoch': 1.42} [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:24,327 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:30,377 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:30,377 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:34,331 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:34,331 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.283, 'learning_rate': 0.0002771098265895954, 'epoch': 1.42} [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:38,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:38,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:42,453 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:42,453 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:44,770 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:44,770 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:48,858 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:48,858 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:52,557 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:40:52,557 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5555, 'learning_rate': 0.0002767630057803468, 'epoch': 1.43} [WARNING|modeling_bart.py:1051] 2022-03-22 18:40:56,439 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:00,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:02,231 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:02,231 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:04,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:06,222 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:08,061 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:09,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:09,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:11,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:13,496 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:16,832 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:16,832 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:18,509 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:20,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:21,640 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:21,640 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:24,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:26,194 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:27,625 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:27,625 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:30,384 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:32,930 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:34,190 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:34,190 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:36,596 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:38,767 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:38,767 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:40,870 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:42,671 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:42,671 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:45,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:45,970 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:45,970 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:49,219 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:49,219 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:52,827 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:52,827 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:56,331 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:56,331 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:59,937 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:41:59,937 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:03,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:03,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:06,996 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:06,996 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:10,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:10,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:10,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:13,918 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:17,485 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:17,485 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:20,882 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:20,882 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:24,281 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:27,621 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:27,621 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.5137, 'learning_rate': 0.00027468208092485546, 'epoch': 1.46} [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:31,049 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:31,049 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:34,480 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.9133, 'learning_rate': 0.00027450867052023116, 'epoch': 1.46} [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.6317, 'learning_rate': 0.0002743352601156069, 'epoch': 1.46} [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.4746, 'learning_rate': 0.00027416184971098265, 'epoch': 1.46} [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.2373, 'learning_rate': 0.00027398843930635835, 'epoch': 1.46} [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:42:37,831 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.0943, 'learning_rate': 0.0002738150289017341, 'epoch': 1.47} 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.0145, 'learning_rate': 0.00027364161849710984, 'epoch': 1.47} 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 654/2230 [2:11:00<5:44:20, 13.11s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.7654, 'learning_rate': 0.00027346820809248554, 'epoch': 1.47} 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.7211, 'learning_rate': 0.00027329479768786123, 'epoch': 1.47} 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5969, 'learning_rate': 0.000273121387283237, 'epoch': 1.48} 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 29%|██████████████████████▎ | 656/2230 [2:11:27<5:44:59, 13.15s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5845, 'learning_rate': 0.0002729479768786127, 'epoch': 1.48} 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 659/2230 [2:12:06<5:39:30, 12.97s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.4751, 'learning_rate': 0.0002727745664739884, 'epoch': 1.48} 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▍ | 660/2230 [2:12:18<5:37:11, 12.89s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.3363, 'learning_rate': 0.0002726011560693641, 'epoch': 1.48} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.4412, 'learning_rate': 0.00027242774566473986, 'epoch': 1.48} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2796, 'learning_rate': 0.0002722543352601156, 'epoch': 1.49} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2538, 'learning_rate': 0.0002720809248554913, 'epoch': 1.49} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1078, 'learning_rate': 0.00027190751445086705, 'epoch': 1.49} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2755, 'learning_rate': 0.00027173410404624274, 'epoch': 1.49} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0332, 'learning_rate': 0.0002715606936416185, 'epoch': 1.5} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0823, 'learning_rate': 0.0002713872832369942, 'epoch': 1.5} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1291, 'learning_rate': 0.00027121387283236993, 'epoch': 1.5} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0924, 'learning_rate': 0.0002710404624277456, 'epoch': 1.5} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1536, 'learning_rate': 0.0002708670520231214, 'epoch': 1.5} Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1519, 'learning_rate': 0.00027069364161849707, 'epoch': 1.51} 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 672/2230 [2:14:48<5:13:59, 12.09s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0834, 'learning_rate': 0.0002705202312138728, 'epoch': 1.51} 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0253, 'learning_rate': 0.0002703468208092485, 'epoch': 1.51} 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0866, 'learning_rate': 0.00027017341040462426, 'epoch': 1.51} 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|██████████████████████▉ | 673/2230 [2:14:59<5:11:56, 12.02s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9893, 'learning_rate': 0.00027, 'epoch': 1.52} 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 676/2230 [2:15:36<5:14:05, 12.13s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:25,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:25,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:29,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:29,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:29,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9538, 'learning_rate': 0.0002696531791907514, 'epoch': 1.52} 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9511, 'learning_rate': 0.00026947976878612714, 'epoch': 1.52} 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 30%|███████████████████████ | 678/2230 [2:15:58<5:00:57, 11.64s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.988, 'learning_rate': 0.0002693063583815029, 'epoch': 1.52} [WARNING|modeling_utils.py:388] 2022-03-22 18:48:59,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:59,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:59,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:48:59,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▏ | 681/2230 [2:16:31<4:44:18, 11.01s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▏ | 681/2230 [2:16:31<4:44:18, 11.01s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.903, 'learning_rate': 0.0002691329479768786, 'epoch': 1.53} 31%|███████████████████████▏ | 681/2230 [2:16:31<4:44:18, 11.01s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:14,214 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:14,214 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▏ | 682/2230 [2:16:41<4:39:15, 10.82s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▏ | 682/2230 [2:16:41<4:39:15, 10.82s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9232, 'learning_rate': 0.00026895953757225433, 'epoch': 1.53} [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:22,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:22,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:22,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▎ | 683/2230 [2:16:51<4:33:41, 10.62s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▎ | 683/2230 [2:16:51<4:33:41, 10.62s/it] Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:30,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:30,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:30,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:30,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▎ | 684/2230 [2:17:01<4:27:56, 10.40s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▎ | 684/2230 [2:17:01<4:27:56, 10.40s/it]g-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:40,532 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:42,915 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:42,915 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:42,915 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:42,915 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:48,807 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:51,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:49:51,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:55,180 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:22:27,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▍ | 686/2230 [2:17:20<4:12:51, 9.83s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▍ | 686/2230 [2:17:20<4:12:51, 9.83s/it][WARNING|modeling_bart.py:1051] 2022-03-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:59,558 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:49:59,558 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:03,157 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:05,216 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:05,216 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:07,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:07,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:07,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:12,807 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:14,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:14,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:16,663 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:18,477 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:20,247 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:20,247 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:22,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:25,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:27,127 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:28,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:28,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:30,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:33,381 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:34,872 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:34,872 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:37,662 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:38,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:41,495 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:41,495 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:43,897 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:45,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:45,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:48,166 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:50,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:50,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:51,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:53,782 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:53,782 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:56,050 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:56,050 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:56,050 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:59,837 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:50:59,837 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:03,459 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:03,459 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:06,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:06,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:06,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:10,469 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:14,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:14,018 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:17,572 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:17,572 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:21,031 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:24,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:24,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 3.028, 'learning_rate': 0.00026618497109826586, 'epoch': 1.57} [WARNING|modeling_utils.py:388] 2022-03-22 18:51:28,148 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:28,148 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:31,654 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:31,654 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:35,123 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:35,123 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:38,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:38,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:42,047 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:42,047 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:45,384 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.8521, 'learning_rate': 0.0002658381502890173, 'epoch': 1.57} [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:51:48,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.0182, 'learning_rate': 0.00026566473988439305, 'epoch': 1.57} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5691, 'learning_rate': 0.00026549132947976874, 'epoch': 1.57} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5199, 'learning_rate': 0.0002653179190751445, 'epoch': 1.58} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.4381, 'learning_rate': 0.00026514450867052024, 'epoch': 1.58} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2407, 'learning_rate': 0.00026497109826589593, 'epoch': 1.58} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.288, 'learning_rate': 0.0002647976878612716, 'epoch': 1.58} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1379, 'learning_rate': 0.00026462427745664737, 'epoch': 1.59} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1389, 'learning_rate': 0.0002644508670520231, 'epoch': 1.59} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1698, 'learning_rate': 0.0002642774566473988, 'epoch': 1.59} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9825, 'learning_rate': 0.00026410404624277456, 'epoch': 1.59} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9905, 'learning_rate': 0.00026393063583815025, 'epoch': 1.59} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8542, 'learning_rate': 0.000263757225433526, 'epoch': 1.6} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9222, 'learning_rate': 0.0002635838150289017, 'epoch': 1.6} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8739, 'learning_rate': 0.00026341040462427744, 'epoch': 1.6} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8896, 'learning_rate': 0.00026323699421965314, 'epoch': 1.6} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.827, 'learning_rate': 0.0002630635838150289, 'epoch': 1.61} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8345, 'learning_rate': 0.0002628901734104046, 'epoch': 1.61} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8356, 'learning_rate': 0.0002627167630057803, 'epoch': 1.61} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.787, 'learning_rate': 0.000262543352601156, 'epoch': 1.61} 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 31%|███████████████████████▉ | 701/2230 [2:19:31<5:22:30, 12.66s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7632, 'learning_rate': 0.00026236994219653177, 'epoch': 1.61} [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7849, 'learning_rate': 0.0002621965317919075, 'epoch': 1.62} [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9055, 'learning_rate': 0.0002620231213872832, 'epoch': 1.62} [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:56:09,307 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.88, 'learning_rate': 0.0002618497109826589, 'epoch': 1.62} 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6907, 'learning_rate': 0.00026167630057803465, 'epoch': 1.62} 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7317, 'learning_rate': 0.0002615028901734104, 'epoch': 1.63} 32%|████████████████████████▋ | 723/2230 [2:24:11<5:01:05, 11.99s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:57:18,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:57:18,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:57:18,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7804, 'learning_rate': 0.0002613294797687861, 'epoch': 1.63} 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6707, 'learning_rate': 0.00026115606936416184, 'epoch': 1.63} 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▋ | 726/2230 [2:24:48<5:04:59, 12.17s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6019, 'learning_rate': 0.0002609826589595376, 'epoch': 1.63} 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6316, 'learning_rate': 0.0002608092485549133, 'epoch': 1.63} 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▊ | 728/2230 [2:25:10<4:53:26, 11.72s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7106, 'learning_rate': 0.000260635838150289, 'epoch': 1.64} [WARNING|modeling_utils.py:388] 2022-03-22 18:58:11,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:11,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:11,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:11,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 731/2230 [2:25:42<4:36:04, 11.05s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 731/2230 [2:25:42<4:36:04, 11.05s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:22,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:22,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:22,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:22,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 732/2230 [2:25:53<4:30:24, 10.83s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 732/2230 [2:25:53<4:30:24, 10.83s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7568, 'learning_rate': 0.00026028901734104047, 'epoch': 1.64} [WARNING|modeling_bart.py:1051] 2022-03-22 18:58:34,473 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:58:34,473 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:58:34,473 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 733/2230 [2:26:03<4:24:42, 10.61s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|████████████████████████▉ | 733/2230 [2:26:03<4:24:42, 10.61s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6379, 'learning_rate': 0.00026011560693641616, 'epoch': 1.64} 33%|████████████████████████▉ | 733/2230 [2:26:03<4:24:42, 10.61s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:46,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:46,167 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|█████████████████████████ | 734/2230 [2:26:13<4:19:25, 10.41s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|█████████████████████████ | 734/2230 [2:26:13<4:19:25, 10.41s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:52,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:58:52,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:58:56,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:58:56,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:00,653 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:00,653 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:02,950 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:02,950 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:07,076 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|█████████████████████████ | 736/2230 [2:26:32<4:05:42, 9.87s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 33%|█████████████████████████ | 736/2230 [2:26:32<4:05:42, 9.87s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:10,877 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 18:59:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:16,850 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:16,850 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:19,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:21,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:21,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:21,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:26,805 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:26,805 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:28,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:30,828 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:32,728 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:34,553 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:34,553 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:36,388 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:38,132 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:41,485 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:41,485 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:43,190 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:44,815 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:46,358 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:49,406 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:49,406 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:50,806 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:53,520 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:54,867 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:54,867 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:57,272 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:59,659 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 18:59:59,659 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:01,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:03,827 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:03,827 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:05,620 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:07,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:07,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:09,687 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:09,687 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0277, 'learning_rate': 0.00025786127167630056, 'epoch': 1.67} [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:12,928 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:12,928 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:16,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:20,059 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:20,059 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:23,631 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:23,631 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:23,631 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:27,231 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:27,231 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:30,726 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:34,193 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:34,193 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:34,193 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:37,704 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:37,704 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:41,285 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:44,735 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:44,735 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:48,191 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:48,191 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:48,191 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:51,605 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:55,125 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:55,125 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:58,471 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:00:58,471 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:01,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2491, 'learning_rate': 0.00025716763005780344, 'epoch': 1.68} [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:01:05,186 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.2336, 'learning_rate': 0.00025699421965317914, 'epoch': 1.68} 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1964, 'learning_rate': 0.0002568208092485549, 'epoch': 1.69} 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▌ | 751/2230 [2:28:44<5:11:18, 12.63s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1691, 'learning_rate': 0.00025664739884393063, 'epoch': 1.69} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.959, 'learning_rate': 0.0002564739884393063, 'epoch': 1.69} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9412, 'learning_rate': 0.00025630057803468207, 'epoch': 1.69} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9436, 'learning_rate': 0.0002561271676300578, 'epoch': 1.7} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7702, 'learning_rate': 0.0002559537572254335, 'epoch': 1.7} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7077, 'learning_rate': 0.0002557803468208092, 'epoch': 1.7} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6719, 'learning_rate': 0.00025560693641618496, 'epoch': 1.7} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6603, 'learning_rate': 0.0002554335260115607, 'epoch': 1.7} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7441, 'learning_rate': 0.0002552601156069364, 'epoch': 1.71} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6565, 'learning_rate': 0.0002550867052023121, 'epoch': 1.71} 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|█████████████████████████▋ | 753/2230 [2:29:11<5:20:34, 13.02s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6591, 'learning_rate': 0.00025491329479768784, 'epoch': 1.71} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6369, 'learning_rate': 0.0002547398843930636, 'epoch': 1.71} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6335, 'learning_rate': 0.0002545664739884393, 'epoch': 1.72} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6988, 'learning_rate': 0.00025439306358381503, 'epoch': 1.72} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7381, 'learning_rate': 0.0002542196531791907, 'epoch': 1.72} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7094, 'learning_rate': 0.00025404624277456647, 'epoch': 1.72} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6374, 'learning_rate': 0.00025387283236994216, 'epoch': 1.72} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6439, 'learning_rate': 0.0002536994219653179, 'epoch': 1.73} 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 34%|██████████████████████████ | 763/2230 [2:31:22<5:25:49, 13.33s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5836, 'learning_rate': 0.0002535260115606936, 'epoch': 1.73} 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 771/2230 [2:33:00<4:57:19, 12.23s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6539, 'learning_rate': 0.00025335260115606935, 'epoch': 1.73} 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5807, 'learning_rate': 0.00025317919075144504, 'epoch': 1.73} 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6986, 'learning_rate': 0.0002530057803468208, 'epoch': 1.74} 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6707, 'learning_rate': 0.0002528323699421965, 'epoch': 1.74} 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▎ | 772/2230 [2:33:12<4:54:50, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▍ | 776/2230 [2:34:01<4:56:39, 12.24s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▍ | 776/2230 [2:34:01<4:56:39, 12.24s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▍ | 776/2230 [2:34:01<4:56:39, 12.24s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▍ | 776/2230 [2:34:01<4:56:39, 12.24s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6065, 'learning_rate': 0.000252485549132948, 'epoch': 1.74} [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:06:47,120 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.591, 'learning_rate': 0.0002523121387283237, 'epoch': 1.74} 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6904, 'learning_rate': 0.00025213872832369937, 'epoch': 1.75} 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 778/2230 [2:34:24<4:44:43, 11.77s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5763, 'learning_rate': 0.0002519653179190751, 'epoch': 1.75} [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:21,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 781/2230 [2:34:57<4:28:03, 11.10s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 781/2230 [2:34:57<4:28:03, 11.10s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6093, 'learning_rate': 0.00025179190751445086, 'epoch': 1.75} 35%|██████████████████████████▌ | 781/2230 [2:34:57<4:28:03, 11.10s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▌ | 781/2230 [2:34:57<4:28:03, 11.10s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:07:42,146 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▋ | 782/2230 [2:35:07<4:22:23, 10.87s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▋ | 782/2230 [2:35:07<4:22:23, 10.87s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6226, 'learning_rate': 0.00025161849710982656, 'epoch': 1.75} [WARNING|modeling_utils.py:388] 2022-03-22 19:07:47,796 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:47,796 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:47,796 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:54,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:07:54,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6242, 'learning_rate': 0.0002514450867052023, 'epoch': 1.76} [WARNING|modeling_utils.py:388] 2022-03-22 19:07:54,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:00,261 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:00,261 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▋ | 784/2230 [2:35:27<4:10:50, 10.41s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 35%|██████████████████████████▋ | 784/2230 [2:35:27<4:10:50, 10.41s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:06,300 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:06,300 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:10,520 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:10,520 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:10,520 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:14,562 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:14,562 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:18,766 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:18,766 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:22,504 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:22,504 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:24,768 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:26,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:29,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:31,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:31,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5859, 'learning_rate': 0.0002507514450867052, 'epoch': 1.76} [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:34,989 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:08:34,989 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:38,987 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:40,908 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:40,908 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:42,924 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:44,766 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:46,596 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:48,472 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:48,472 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:50,364 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:52,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:55,428 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:55,428 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:57,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:08:58,695 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:01,730 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:01,730 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:03,288 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:04,669 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:07,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:07,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:09,909 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:11,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:11,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:13,428 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:15,531 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:17,593 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:17,593 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:19,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:19,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:21,277 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:23,664 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:23,664 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5475, 'learning_rate': 0.00024919075144508665, 'epoch': 1.78} [WARNING|modeling_utils.py:388] 2022-03-22 19:09:27,474 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:27,474 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:31,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:31,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:34,669 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:38,202 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:38,202 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.8929, 'learning_rate': 0.0002490173410404624, 'epoch': 1.79} [WARNING|modeling_utils.py:388] 2022-03-22 19:09:41,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:41,858 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:45,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:45,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:48,837 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:52,306 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:52,306 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 2.074, 'learning_rate': 0.00024884393063583814, 'epoch': 1.79} [WARNING|modeling_utils.py:388] 2022-03-22 19:09:55,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:55,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:09:59,212 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:02,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:02,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:02,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:06,058 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:06,058 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:09,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:12,957 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:12,957 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0365, 'learning_rate': 0.0002484971098265896, 'epoch': 1.79} [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:10:16,362 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.07, 'learning_rate': 0.00024832369942196533, 'epoch': 1.8} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.0102, 'learning_rate': 0.000248150289017341, 'epoch': 1.8} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8969, 'learning_rate': 0.0002479768786127167, 'epoch': 1.8} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.9204, 'learning_rate': 0.00024780346820809247, 'epoch': 1.8} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8675, 'learning_rate': 0.0002476300578034682, 'epoch': 1.8} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8545, 'learning_rate': 0.0002474566473988439, 'epoch': 1.81} 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▎ | 801/2230 [2:37:58<5:02:07, 12.69s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.635, 'learning_rate': 0.0002472832369942196, 'epoch': 1.81} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5683, 'learning_rate': 0.00024710982658959535, 'epoch': 1.81} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6124, 'learning_rate': 0.0002469364161849711, 'epoch': 1.81} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6162, 'learning_rate': 0.0002467630057803468, 'epoch': 1.82} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5883, 'learning_rate': 0.00024658959537572254, 'epoch': 1.82} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6141, 'learning_rate': 0.00024641618497109823, 'epoch': 1.82} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.538, 'learning_rate': 0.000246242774566474, 'epoch': 1.82} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5417, 'learning_rate': 0.0002460693641618497, 'epoch': 1.83} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6722, 'learning_rate': 0.0002458959537572254, 'epoch': 1.83} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5552, 'learning_rate': 0.0002457225433526011, 'epoch': 1.83} 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 36%|███████████████████████████▌ | 807/2230 [2:39:18<5:12:29, 13.18s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▊ | 817/2230 [2:41:27<4:59:13, 12.71s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▊ | 817/2230 [2:41:27<4:59:13, 12.71s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6006, 'learning_rate': 0.00024554913294797686, 'epoch': 1.83} 37%|███████████████████████████▊ | 817/2230 [2:41:27<4:59:13, 12.71s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▊ | 817/2230 [2:41:27<4:59:13, 12.71s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▊ | 817/2230 [2:41:27<4:59:13, 12.71s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5318, 'learning_rate': 0.00024537572254335256, 'epoch': 1.83} [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:14:15,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5067, 'learning_rate': 0.0002452023121387283, 'epoch': 1.84} 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 819/2230 [2:41:52<4:53:27, 12.48s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4765, 'learning_rate': 0.000245028901734104, 'epoch': 1.84} 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5454, 'learning_rate': 0.00024485549132947975, 'epoch': 1.84} 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6913, 'learning_rate': 0.0002446820809248555, 'epoch': 1.84} 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.481, 'learning_rate': 0.0002445086705202312, 'epoch': 1.85} 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|███████████████████████████▉ | 820/2230 [2:42:04<4:50:59, 12.38s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████ | 824/2230 [2:42:51<4:39:29, 11.93s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████ | 824/2230 [2:42:51<4:39:29, 11.93s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████ | 824/2230 [2:42:51<4:39:29, 11.93s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████ | 824/2230 [2:42:51<4:39:29, 11.93s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████ | 824/2230 [2:42:51<4:39:29, 11.93s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5372, 'learning_rate': 0.00024416184971098263, 'epoch': 1.85} [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:15:39,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4706, 'learning_rate': 0.00024398843930635838, 'epoch': 1.85} 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6338, 'learning_rate': 0.00024381502890173407, 'epoch': 1.85} 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 826/2230 [2:43:16<4:43:46, 12.13s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4947, 'learning_rate': 0.0002436416184971098, 'epoch': 1.86} 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4791, 'learning_rate': 0.00024346820809248554, 'epoch': 1.86} 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▏ | 828/2230 [2:43:39<4:33:42, 11.71s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:36,625 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:36,625 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:36,625 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:40,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:40,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:40,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:40,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▎ | 831/2230 [2:44:12<4:18:31, 11.09s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▎ | 831/2230 [2:44:12<4:18:31, 11.09s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.524, 'learning_rate': 0.00024312138728323698, 'epoch': 1.86} 37%|████████████████████████████▎ | 831/2230 [2:44:12<4:18:31, 11.09s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▎ | 831/2230 [2:44:12<4:18:31, 11.09s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▎ | 831/2230 [2:44:12<4:18:31, 11.09s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:58,962 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:16:58,962 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5472, 'learning_rate': 0.00024294797687861267, 'epoch': 1.87} [WARNING|modeling_utils.py:388] 2022-03-22 19:17:02,924 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:02,924 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:02,924 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:09,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:09,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5067, 'learning_rate': 0.00024277456647398842, 'epoch': 1.87} [WARNING|modeling_utils.py:388] 2022-03-22 19:17:09,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:09,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:17,579 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▍ | 834/2230 [2:44:42<4:04:26, 10.51s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▍ | 834/2230 [2:44:42<4:04:26, 10.51s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5282, 'learning_rate': 0.00024260115606936414, 'epoch': 1.87} 37%|████████████████████████████▍ | 834/2230 [2:44:42<4:04:26, 10.51s/it] Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:25,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:25,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:25,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 37%|████████████████████████████▍ | 835/2230 [2:44:52<3:58:58, 10.28s/it]g-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:31,403 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:33,722 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:33,722 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:37,892 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:37,892 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5299, 'learning_rate': 0.00024225433526011558, 'epoch': 1.87} [WARNING|modeling_utils.py:388] 2022-03-22 19:17:41,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:41,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:45,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:45,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 18:49:57,435 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|████████████████████████████▌ | 837/2230 [2:45:10<3:43:33, 9.63s/it][WARNING|modeling_bart.py:1051] 2022-03-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:49,896 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:49,896 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:17:49,896 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:55,882 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:55,882 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:57,937 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:17:59,799 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:01,683 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:03,502 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:03,502 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:05,406 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:07,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:08,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:08,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:12,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:13,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:15,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:15,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:18,643 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:20,101 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:21,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:21,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:24,253 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:26,735 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:27,905 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:27,905 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:30,212 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:32,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:32,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:34,254 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:36,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:36,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:38,596 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:38,596 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:39,356 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:41,672 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:41,672 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:45,255 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:48,826 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:48,826 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:52,414 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:52,414 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.5202, 'learning_rate': 0.00024034682080924854, 'epoch': 1.9} [WARNING|modeling_utils.py:388] 2022-03-22 19:18:55,960 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:55,960 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:18:59,407 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:02,855 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:02,855 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:06,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:06,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.1851, 'learning_rate': 0.00024017341040462423, 'epoch': 1.9} [WARNING|modeling_utils.py:388] 2022-03-22 19:19:09,883 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:13,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:13,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:16,785 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:16,785 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:20,223 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:20,223 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:23,700 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:23,700 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:27,153 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:27,153 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:30,517 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7796, 'learning_rate': 0.0002398265895953757, 'epoch': 1.91} [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.8909, 'learning_rate': 0.00023965317919075142, 'epoch': 1.91} [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:19:33,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7804, 'learning_rate': 0.00023947976878612714, 'epoch': 1.91} 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6437, 'learning_rate': 0.0002393063583815029, 'epoch': 1.91} 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6786, 'learning_rate': 0.0002391329479768786, 'epoch': 1.91} 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6218, 'learning_rate': 0.0002389595375722543, 'epoch': 1.92} 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████ | 852/2230 [2:47:28<4:56:17, 12.90s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6243, 'learning_rate': 0.00023878612716763002, 'epoch': 1.92} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6316, 'learning_rate': 0.00023861271676300577, 'epoch': 1.92} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.646, 'learning_rate': 0.0002384393063583815, 'epoch': 1.92} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4941, 'learning_rate': 0.0002382658959537572, 'epoch': 1.93} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6629, 'learning_rate': 0.0002380924855491329, 'epoch': 1.93} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5397, 'learning_rate': 0.00023791907514450865, 'epoch': 1.93} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4732, 'learning_rate': 0.00023774566473988437, 'epoch': 1.93} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4691, 'learning_rate': 0.0002375722543352601, 'epoch': 1.93} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5437, 'learning_rate': 0.00023739884393063582, 'epoch': 1.94} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4784, 'learning_rate': 0.00023722543352601156, 'epoch': 1.94} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4133, 'learning_rate': 0.00023705202312138726, 'epoch': 1.94} 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 38%|█████████████████████████████▏ | 856/2230 [2:48:21<5:02:32, 13.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6664, 'learning_rate': 0.00023687861271676298, 'epoch': 1.94} 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5039, 'learning_rate': 0.0002367052023121387, 'epoch': 1.95} 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6268, 'learning_rate': 0.00023653179190751445, 'epoch': 1.95} 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4324, 'learning_rate': 0.00023635838150289017, 'epoch': 1.95} 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▌ | 867/2230 [2:50:42<4:44:44, 12.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5077, 'learning_rate': 0.00023618497109826586, 'epoch': 1.95} 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4605, 'learning_rate': 0.00023601156069364158, 'epoch': 1.96} 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4617, 'learning_rate': 0.00023583815028901733, 'epoch': 1.96} 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▋ | 871/2230 [2:51:29<4:31:40, 11.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5816, 'learning_rate': 0.00023566473988439305, 'epoch': 1.96} 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5213, 'learning_rate': 0.00023549132947976877, 'epoch': 1.96} 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4203, 'learning_rate': 0.0002353179190751445, 'epoch': 1.96} 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▊ | 874/2230 [2:52:04<4:22:36, 11.62s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:11,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:11,716 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4604, 'learning_rate': 0.00023514450867052024, 'epoch': 1.97} 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4604, 'learning_rate': 0.00023497109826589593, 'epoch': 1.97} 39%|█████████████████████████████▉ | 877/2230 [2:52:39<4:19:50, 11.52s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:32,388 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:32,388 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:32,388 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:32,388 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5035, 'learning_rate': 0.00023479768786127165, 'epoch': 1.97} [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:40,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:40,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:40,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:40,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:25:40,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:48,367 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:48,367 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:48,367 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:54,517 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:54,517 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:25:54,517 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4899, 'learning_rate': 0.00023445086705202312, 'epoch': 1.98} [WARNING|modeling_utils.py:388] 2022-03-22 19:26:00,627 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:02,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:02,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████ | 882/2230 [2:53:30<3:50:02, 10.24s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████ | 882/2230 [2:53:30<3:50:02, 10.24s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:08,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:11,170 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:11,170 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:26:15,218 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:26:15,218 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4574, 'learning_rate': 0.00023410404624277454, 'epoch': 1.98} [WARNING|modeling_utils.py:388] 2022-03-22 19:26:19,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:21,088 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:23,160 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:23,160 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:25,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:27,289 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:29,267 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:31,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:31,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:33,132 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:34,983 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:36,803 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:38,559 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:38,559 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:40,349 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:43,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:45,118 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:45,118 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:46,761 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:48,256 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:51,205 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:52,554 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:52,554 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:55,279 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:57,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:26:57,745 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:00,073 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:01,150 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:03,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:03,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:06,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:06,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:07,795 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:09,369 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:09,369 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:11,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:11,849 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:15,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:15,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:19,029 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:19,029 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:22,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:22,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:26,190 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:26,190 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:29,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:29,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:33,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:33,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:36,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:36,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:40,172 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:40,172 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:43,590 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:43,590 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:47,055 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:50,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:50,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:50,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5724, 'learning_rate': 0.00023184971098265893, 'epoch': 2.01} [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5684, 'learning_rate': 0.00023167630057803465, 'epoch': 2.01} [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5024, 'learning_rate': 0.0002315028901734104, 'epoch': 2.01} [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:27:53,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4798, 'learning_rate': 0.00023132947976878612, 'epoch': 2.02} 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4793, 'learning_rate': 0.00023115606936416181, 'epoch': 2.02} 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 899/2230 [2:56:09<4:45:36, 12.87s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3836, 'learning_rate': 0.00023098265895953754, 'epoch': 2.02} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4929, 'learning_rate': 0.00023080924855491328, 'epoch': 2.02} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4115, 'learning_rate': 0.000230635838150289, 'epoch': 2.02} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4635, 'learning_rate': 0.00023046242774566472, 'epoch': 2.03} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3872, 'learning_rate': 0.00023028901734104042, 'epoch': 2.03} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3065, 'learning_rate': 0.00023011560693641617, 'epoch': 2.03} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3686, 'learning_rate': 0.0002299421965317919, 'epoch': 2.03} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.377, 'learning_rate': 0.0002297687861271676, 'epoch': 2.04} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3566, 'learning_rate': 0.00022959537572254333, 'epoch': 2.04} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3976, 'learning_rate': 0.00022942196531791908, 'epoch': 2.04} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4026, 'learning_rate': 0.00022924855491329477, 'epoch': 2.04} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3069, 'learning_rate': 0.0002290751445086705, 'epoch': 2.04} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3355, 'learning_rate': 0.0002289017341040462, 'epoch': 2.05} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2953, 'learning_rate': 0.00022872832369942196, 'epoch': 2.05} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4171, 'learning_rate': 0.00022855491329479768, 'epoch': 2.05} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3288, 'learning_rate': 0.00022838150289017337, 'epoch': 2.05} 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 40%|██████████████████████████████▋ | 901/2230 [2:56:38<4:58:09, 13.46s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3462, 'learning_rate': 0.0002282080924855491, 'epoch': 2.06} 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 917/2230 [3:00:00<4:29:09, 12.30s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 918/2230 [3:00:12<4:25:55, 12.16s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▎ | 918/2230 [3:00:12<4:25:55, 12.16s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3657, 'learning_rate': 0.00022803468208092484, 'epoch': 2.06} [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3701, 'learning_rate': 0.00022786127167630056, 'epoch': 2.06} [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3076, 'learning_rate': 0.00022768786127167628, 'epoch': 2.06} [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:32:53,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3014, 'learning_rate': 0.00022751445086705198, 'epoch': 2.07} 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3676, 'learning_rate': 0.00022734104046242772, 'epoch': 2.07} 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 921/2230 [3:00:47<4:16:32, 11.76s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.322, 'learning_rate': 0.00022716763005780344, 'epoch': 2.07} 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3089, 'learning_rate': 0.00022699421965317917, 'epoch': 2.07} 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3649, 'learning_rate': 0.00022682080924855489, 'epoch': 2.07} 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 41%|███████████████████████████████▍ | 923/2230 [3:01:10<4:11:03, 11.53s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:17,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:17,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▌ | 926/2230 [3:01:44<4:09:19, 11.47s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▌ | 926/2230 [3:01:44<4:09:19, 11.47s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2995, 'learning_rate': 0.00022664739884393063, 'epoch': 2.08} 42%|███████████████████████████████▌ | 926/2230 [3:01:44<4:09:19, 11.47s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▌ | 926/2230 [3:01:44<4:09:19, 11.47s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:34:29,804 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:34:29,804 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:34:29,804 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3248, 'learning_rate': 0.00022647398843930635, 'epoch': 2.08} [WARNING|modeling_utils.py:388] 2022-03-22 19:34:35,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:35,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:35,519 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:41,869 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:41,869 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3239, 'learning_rate': 0.00022630057803468205, 'epoch': 2.08} [WARNING|modeling_utils.py:388] 2022-03-22 19:34:45,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:45,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:45,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:51,885 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:51,885 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3388, 'learning_rate': 0.00022612716763005777, 'epoch': 2.08} [WARNING|modeling_utils.py:388] 2022-03-22 19:34:51,885 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:58,079 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:58,079 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:34:58,079 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▋ | 930/2230 [3:02:25<3:44:44, 10.37s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:35:04,084 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:35:04,084 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:08,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:08,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:08,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:35:12,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:35:12,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:16,423 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:18,671 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▊ | 932/2230 [3:02:43<3:32:00, 9.80s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 42%|███████████████████████████████▊ | 932/2230 [3:02:43<3:32:00, 9.80s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3731, 'learning_rate': 0.00022560693641618496, 'epoch': 2.09} [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:24,230 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:26,336 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:28,477 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:28,477 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:30,655 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:32,658 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:34,708 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:36,663 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:36,663 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:38,718 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:40,629 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:42,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:44,331 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:44,331 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:46,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:48,032 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:49,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:49,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:53,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:54,730 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:56,288 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:56,288 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:35:59,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:00,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:00,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:05,140 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:05,140 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:06,571 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:09,046 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:11,481 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:11,481 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:12,588 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:14,706 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:14,706 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:17,651 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:19,500 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:19,500 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:21,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:21,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:21,864 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:25,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:25,313 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:28,941 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:28,941 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:32,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:32,512 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:36,014 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:36,014 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:39,664 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:39,664 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:43,130 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:43,130 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:46,608 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:50,067 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:50,067 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:50,067 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:53,594 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:53,594 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:57,082 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:36:57,082 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:00,545 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:03,991 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:03,991 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.663, 'learning_rate': 0.00022335260115606933, 'epoch': 2.12} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5875, 'learning_rate': 0.00022317919075144507, 'epoch': 2.12} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5918, 'learning_rate': 0.0002230057803468208, 'epoch': 2.12} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.6611, 'learning_rate': 0.00022283236994219652, 'epoch': 2.13} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.521, 'learning_rate': 0.00022265895953757224, 'epoch': 2.13} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4735, 'learning_rate': 0.00022248554913294798, 'epoch': 2.13} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4599, 'learning_rate': 0.00022231213872832368, 'epoch': 2.13} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4246, 'learning_rate': 0.0002221387283236994, 'epoch': 2.13} [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:37:07,515 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4222, 'learning_rate': 0.00022196531791907512, 'epoch': 2.14} 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4073, 'learning_rate': 0.00022179190751445087, 'epoch': 2.14} 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2894, 'learning_rate': 0.0002216184971098266, 'epoch': 2.14} 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3121, 'learning_rate': 0.00022144508670520228, 'epoch': 2.14} 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▍ | 953/2230 [3:06:16<4:39:30, 13.13s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4046, 'learning_rate': 0.000221271676300578, 'epoch': 2.15} 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▌ | 957/2230 [3:07:07<4:32:01, 12.82s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3066, 'learning_rate': 0.00022109826589595375, 'epoch': 2.15} 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 958/2230 [3:07:19<4:29:42, 12.72s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2934, 'learning_rate': 0.00022092485549132947, 'epoch': 2.15} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3144, 'learning_rate': 0.0002207514450867052, 'epoch': 2.15} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3406, 'learning_rate': 0.00022057803468208088, 'epoch': 2.15} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2953, 'learning_rate': 0.00022040462427745663, 'epoch': 2.16} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.311, 'learning_rate': 0.00022023121387283235, 'epoch': 2.16} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3591, 'learning_rate': 0.00022005780346820807, 'epoch': 2.16} 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▋ | 959/2230 [3:07:32<4:28:12, 12.66s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:41:20,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:41:20,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:41:20,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3229, 'learning_rate': 0.00021971098265895954, 'epoch': 2.17} 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2787, 'learning_rate': 0.00021953757225433524, 'epoch': 2.17} 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 43%|████████████████████████████████▉ | 965/2230 [3:08:47<4:23:32, 12.50s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:41:59,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:41:59,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3134, 'learning_rate': 0.00021936416184971096, 'epoch': 2.17} [WARNING|modeling_utils.py:388] 2022-03-22 19:41:59,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:41:59,460 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3083, 'learning_rate': 0.00021919075144508668, 'epoch': 2.17} [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3141, 'learning_rate': 0.00021901734104046243, 'epoch': 2.17} [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:42:07,725 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3645, 'learning_rate': 0.00021884393063583815, 'epoch': 2.18} [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3452, 'learning_rate': 0.00021867052023121384, 'epoch': 2.18} [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:42:33,999 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 44%|█████████████████████████████████▏ | 973/2230 [3:10:19<3:58:38, 11.39s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 44%|█████████████████████████████████▏ | 973/2230 [3:10:19<3:58:38, 11.39s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3692, 'learning_rate': 0.00021849710982658956, 'epoch': 2.18} 44%|█████████████████████████████████▏ | 973/2230 [3:10:19<3:58:38, 11.39s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 44%|█████████████████████████████████▏ | 973/2230 [3:10:19<3:58:38, 11.39s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 44%|█████████████████████████████████▏ | 973/2230 [3:10:19<3:58:38, 11.39s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:07,112 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:07,112 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:07,112 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2614, 'learning_rate': 0.00021815028901734103, 'epoch': 2.19} [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3473, 'learning_rate': 0.00021797687861271675, 'epoch': 2.19} [WARNING|modeling_utils.py:388] 2022-03-22 19:43:11,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.36, 'learning_rate': 0.00021780346820809247, 'epoch': 2.19} [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:35,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:43:47,752 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:43:47,752 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:43:47,752 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3494, 'learning_rate': 0.00021763005780346822, 'epoch': 2.19} [WARNING|modeling_bart.py:1051] 2022-03-22 19:43:47,752 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:55,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:43:55,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:00,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:00,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3341, 'learning_rate': 0.0002174566473988439, 'epoch': 2.2} [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:00,228 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:06,334 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:06,334 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:10,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:10,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3583, 'learning_rate': 0.00021728323699421963, 'epoch': 2.2} [WARNING|modeling_utils.py:388] 2022-03-22 19:44:10,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:16,209 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:18,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:18,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:18,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:22,714 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:22,714 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:26,433 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:28,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 19:44:28,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2711, 'learning_rate': 0.0002169364161849711, 'epoch': 2.2} [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:32,440 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:34,541 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:36,589 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:36,589 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:38,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:40,671 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:42,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:44,452 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:44,452 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:46,372 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:48,179 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:49,978 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:51,716 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:51,716 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:53,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:56,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:58,516 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:44:58,516 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:00,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:03,106 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:04,492 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:04,492 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:07,261 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:07,261 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:10,187 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:10,187 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:12,899 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:15,258 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:17,492 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:17,492 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:19,520 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:21,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:21,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:23,342 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:23,342 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:25,169 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:27,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:27,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3912, 'learning_rate': 0.0002152023121387283, 'epoch': 2.22} [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:30,774 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:30,774 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:34,358 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:37,928 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:37,928 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:41,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:41,456 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 1.052, 'learning_rate': 0.00021502890173410403, 'epoch': 2.23} [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:45,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:45,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:48,657 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:52,081 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:52,081 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:55,541 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:55,541 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7914, 'learning_rate': 0.00021485549132947972, 'epoch': 2.23} [WARNING|modeling_bart.py:1051] 2022-03-22 19:45:59,044 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:02,476 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:02,476 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:05,919 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:05,919 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:09,321 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:09,321 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:12,815 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:12,815 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:16,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:16,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5849, 'learning_rate': 0.0002145086705202312, 'epoch': 2.23} [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4353, 'learning_rate': 0.0002143352601156069, 'epoch': 2.24} [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5717, 'learning_rate': 0.00021416184971098263, 'epoch': 2.24} [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 19:46:19,513 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5272, 'learning_rate': 0.00021398843930635838, 'epoch': 2.24} 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|██████████████████████████████████ | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/22/2022 19:56:49 - INFO - datasets.metric - Removing /home/sanchit_huggingface_co/.cache/huggingface/metrics/wer/default/default_experiment-1-0.arrow {'eval_loss': 0.4856905937194824, 'eval_wer': 0.14488298294327648, 'eval_runtime': 570.5354, 'eval_samples_per_second': 4.631, 'eval_steps_per_second': 0.58, 'epoch': 2.24} [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4389, 'learning_rate': 0.0002136416184971098, 'epoch': 2.24} [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3892, 'learning_rate': 0.00021346820809248551, 'epoch': 2.25} [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3714, 'learning_rate': 0.00021329479768786126, 'epoch': 2.25} [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 19:47:19,110 >> Num examples = 2642 | 999/2230 [3:14:27<4:22:23, 12.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4208, 'learning_rate': 0.00021312138728323698, 'epoch': 2.25} 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3307, 'learning_rate': 0.0002129479768786127, 'epoch': 2.25} 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2909, 'learning_rate': 0.0002127745664739884, 'epoch': 2.26} 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▎ | 1004/2230 [3:26:42<27:59:21, 82.19s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.388, 'learning_rate': 0.00021260115606936414, 'epoch': 2.26} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3202, 'learning_rate': 0.00021242774566473987, 'epoch': 2.26} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.313, 'learning_rate': 0.00021225433526011559, 'epoch': 2.26} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3205, 'learning_rate': 0.0002120809248554913, 'epoch': 2.26} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.364, 'learning_rate': 0.00021190751445086705, 'epoch': 2.27} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4024, 'learning_rate': 0.00021173410404624275, 'epoch': 2.27} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3499, 'learning_rate': 0.00021156069364161847, 'epoch': 2.27} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3411, 'learning_rate': 0.0002113872832369942, 'epoch': 2.27} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3433, 'learning_rate': 0.00021121387283236994, 'epoch': 2.28} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2937, 'learning_rate': 0.00021104046242774566, 'epoch': 2.28} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3456, 'learning_rate': 0.00021086705202312135, 'epoch': 2.28} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2884, 'learning_rate': 0.00021069364161849707, 'epoch': 2.28} 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 45%|█████████████████████████████████▍ | 1007/2230 [3:27:23<12:37:13, 37.15s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3717, 'learning_rate': 0.00021052023121387282, 'epoch': 2.28} 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3088, 'learning_rate': 0.00021034682080924854, 'epoch': 2.29} 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1019/2230 [3:29:54<4:11:00, 12.44s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2876, 'learning_rate': 0.00021017341040462426, 'epoch': 2.29} [WARNING|modeling_bart.py:1051] 2022-03-22 20:02:58,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:02:58,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:02:58,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:02:58,543 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1022/2230 [3:30:29<3:59:04, 11.87s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1022/2230 [3:30:29<3:59:04, 11.87s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3473, 'learning_rate': 0.00020999999999999998, 'epoch': 2.29} 46%|██████████████████████████████████▎ | 1022/2230 [3:30:29<3:59:04, 11.87s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▎ | 1022/2230 [3:30:29<3:59:04, 11.87s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3416, 'learning_rate': 0.00020982658959537573, 'epoch': 2.29} [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:14,767 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▍ | 1024/2230 [3:30:52<3:53:47, 11.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▍ | 1024/2230 [3:30:52<3:53:47, 11.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2797, 'learning_rate': 0.00020965317919075142, 'epoch': 2.3} 46%|██████████████████████████████████▍ | 1024/2230 [3:30:52<3:53:47, 11.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▍ | 1024/2230 [3:30:52<3:53:47, 11.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2986, 'learning_rate': 0.00020947976878612714, 'epoch': 2.3} [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:37,317 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:55,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:55,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:55,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:55,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:03:55,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▌ | 1027/2230 [3:31:26<3:46:56, 11.32s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:05,900 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:05,900 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:05,900 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:05,900 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:05,900 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▌ | 1028/2230 [3:31:37<3:41:15, 11.04s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▌ | 1028/2230 [3:31:37<3:41:15, 11.04s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:18,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:18,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:18,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▌ | 1029/2230 [3:31:47<3:35:52, 10.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▌ | 1029/2230 [3:31:47<3:35:52, 10.79s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:26,445 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:26,445 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:26,445 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:32,458 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:32,458 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2575, 'learning_rate': 0.00020861271676300575, 'epoch': 2.31} [WARNING|modeling_utils.py:388] 2022-03-22 20:04:32,458 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:32,458 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:40,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:40,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:40,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:44,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:44,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:48,465 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:50,702 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:04:50,702 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 46%|██████████████████████████████████▋ | 1032/2230 [3:32:15<3:16:33, 9.84s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:54,406 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:56,497 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:04:58,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:00,681 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:00,681 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:02,798 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:04,769 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:06,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:08,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:08,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:10,672 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:12,511 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:14,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:14,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:16,131 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:17,958 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:19,701 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:22,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:22,978 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:24,627 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:26,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:29,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:29,138 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:30,602 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:30,602 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:34,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:34,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:37,396 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:38,582 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:38,582 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:40,832 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:43,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:43,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:44,930 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:46,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:46,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:49,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:51,205 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:51,205 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:51,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:51,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:55,758 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:55,758 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:05:59,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:03,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:03,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:03,009 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:06,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:06,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:10,208 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:13,776 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:13,776 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:17,289 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:17,289 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:17,289 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:20,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:20,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:24,364 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:27,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:27,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:31,325 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:31,325 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:31,325 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:34,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:34,744 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:38,255 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:41,713 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:41,713 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:06:41,713 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5231, 'learning_rate': 0.0002058381502890173, 'epoch': 2.35} 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5203, 'learning_rate': 0.00020566473988439305, 'epoch': 2.35} 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4873, 'learning_rate': 0.00020549132947976877, 'epoch': 2.35} 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▏ | 1046/2230 [3:34:11<3:51:46, 11.75s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4232, 'learning_rate': 0.0002053179190751445, 'epoch': 2.35} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4999, 'learning_rate': 0.00020514450867052021, 'epoch': 2.35} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4522, 'learning_rate': 0.00020497109826589596, 'epoch': 2.36} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2964, 'learning_rate': 0.00020479768786127166, 'epoch': 2.36} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3664, 'learning_rate': 0.00020462427745664738, 'epoch': 2.36} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3669, 'learning_rate': 0.0002044508670520231, 'epoch': 2.36} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3585, 'learning_rate': 0.00020427745664739885, 'epoch': 2.37} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4454, 'learning_rate': 0.00020410404624277457, 'epoch': 2.37} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.348, 'learning_rate': 0.00020393063583815026, 'epoch': 2.37} 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 47%|███████████████████████████████████▎ | 1049/2230 [3:34:52<4:15:43, 12.99s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4164, 'learning_rate': 0.00020375722543352598, 'epoch': 2.37} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.324, 'learning_rate': 0.00020358381502890173, 'epoch': 2.37} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3249, 'learning_rate': 0.00020341040462427745, 'epoch': 2.38} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3287, 'learning_rate': 0.00020323699421965317, 'epoch': 2.38} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.323, 'learning_rate': 0.00020306358381502886, 'epoch': 2.38} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3061, 'learning_rate': 0.0002028901734104046, 'epoch': 2.38} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3294, 'learning_rate': 0.00020271676300578033, 'epoch': 2.39} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2192, 'learning_rate': 0.00020254335260115605, 'epoch': 2.39} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.275, 'learning_rate': 0.00020236994219653177, 'epoch': 2.39} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.282, 'learning_rate': 0.00020219653179190752, 'epoch': 2.39} g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1068/2230 [3:38:56<3:56:32, 12.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1068/2230 [3:38:56<3:56:32, 12.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2662, 'learning_rate': 0.00020202312138728321, 'epoch': 2.39} 48%|███████████████████████████████████▉ | 1068/2230 [3:38:56<3:56:32, 12.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1068/2230 [3:38:56<3:56:32, 12.21s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:11:41,504 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:11:41,504 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2716, 'learning_rate': 0.00020184971098265893, 'epoch': 2.4} 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3011, 'learning_rate': 0.00020167630057803466, 'epoch': 2.4} 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2728, 'learning_rate': 0.0002015028901734104, 'epoch': 2.4} 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|███████████████████████████████████▉ | 1069/2230 [3:39:08<3:53:36, 12.07s/it] Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3624, 'learning_rate': 0.00020132947976878612, 'epoch': 2.4} [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2688, 'learning_rate': 0.00020115606936416184, 'epoch': 2.41} [WARNING|modeling_utils.py:388] 2022-03-22 20:12:18,130 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:36,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:36,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2764, 'learning_rate': 0.00020098265895953754, 'epoch': 2.41} [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2469, 'learning_rate': 0.00020080924855491329, 'epoch': 2.41} [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2857, 'learning_rate': 0.000200635838150289, 'epoch': 2.41} [WARNING|modeling_utils.py:388] 2022-03-22 20:12:40,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3131, 'learning_rate': 0.00020046242774566473, 'epoch': 2.41} [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:11,229 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1078/2230 [3:40:50<3:32:40, 11.08s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1078/2230 [3:40:50<3:32:40, 11.08s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2928, 'learning_rate': 0.00020028901734104045, 'epoch': 2.42} 48%|████████████████████████████████████▎ | 1078/2230 [3:40:50<3:32:40, 11.08s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:33,621 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:33,621 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1079/2230 [3:41:00<3:28:49, 10.89s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1079/2230 [3:41:00<3:28:49, 10.89s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3083, 'learning_rate': 0.0002001156069364162, 'epoch': 2.42} 48%|████████████████████████████████████▎ | 1079/2230 [3:41:00<3:28:49, 10.89s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:43,806 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:43,806 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1080/2230 [3:41:11<3:23:46, 10.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1080/2230 [3:41:11<3:23:46, 10.63s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:50,040 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:50,040 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:53,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:53,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1081/2230 [3:41:20<3:18:34, 10.37s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 48%|████████████████████████████████████▎ | 1081/2230 [3:41:20<3:18:34, 10.37s/it]g-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:13:59,732 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:02,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:02,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:06,223 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:06,223 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3392, 'learning_rate': 0.00019959537572254333, 'epoch': 2.43} [WARNING|modeling_utils.py:388] 2022-03-22 20:14:10,060 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:12,232 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:14,353 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:16,504 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:14:16,504 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3088, 'learning_rate': 0.00019942196531791905, 'epoch': 2.43} [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:20,325 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:22,363 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 19:17:47,849 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▍ | 1084/2230 [3:41:47<2:57:10, 9.28s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:24,434 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▍ | 1084/2230 [3:41:47<2:57:10, 9.28s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:24,434 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:26,469 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:24,434 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:28,376 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:24,434 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:30,237 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:24,434 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▍ | 1085/2230 [3:41:55<2:48:17, 8.82s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▍ | 1085/2230 [3:41:55<2:48:17, 8.82s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:33,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:35,635 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1086/2230 [3:42:02<2:38:03, 8.29s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1086/2230 [3:42:02<2:38:03, 8.29s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:32,128 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:40,735 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:39,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:42,308 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:39,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1087/2230 [3:42:08<2:26:42, 7.70s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:39,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1087/2230 [3:42:08<2:26:42, 7.70s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:39,126 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:46,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:45,392 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:46,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:45,392 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:51,027 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:45,392 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1088/2230 [3:42:15<2:22:52, 7.51s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:52,433 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▌ | 1088/2230 [3:42:15<2:22:52, 7.51s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:14:52,433 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:54,806 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:52,433 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1089/2230 [3:42:20<2:07:16, 6.69s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:52,433 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1089/2230 [3:42:20<2:07:16, 6.69s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:52,433 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:14:58,196 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:14:57,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1090/2230 [3:42:24<1:52:41, 5.93s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:57,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1090/2230 [3:42:24<1:52:41, 5.93s/it] Setting `use_cache=False`...1] 2022-03-22 20:14:57,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:03,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:01,257 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1091/2230 [3:42:28<1:40:04, 5.27s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:04,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1091/2230 [3:42:28<1:40:04, 5.27s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:04,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:06,541 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:04,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1092/2230 [3:42:31<1:28:22, 4.66s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▋ | 1092/2230 [3:42:31<1:28:22, 4.66s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3742, 'learning_rate': 0.00019786127167630056, 'epoch': 2.45} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:12,759 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:12,759 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:16,340 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:16,340 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:19,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:19,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:09,120 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1093/2230 [3:42:45<2:24:10, 7.61s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1093/2230 [3:42:45<2:24:10, 7.61s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:26,980 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:26,980 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:30,428 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:33,860 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:33,860 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:33,860 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:23,519 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1094/2230 [3:42:59<2:59:52, 9.50s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1094/2230 [3:42:59<2:59:52, 9.50s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:40,895 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:40,895 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:44,358 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:47,853 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:47,853 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:15:37,415 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1095/2230 [3:43:13<3:25:11, 10.85s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|████████████████████████████████████▊ | 1095/2230 [3:43:13<3:25:11, 10.85s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:54,750 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4765, 'learning_rate': 0.00019716763005780345, 'epoch': 2.46} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4759, 'learning_rate': 0.00019699421965317917, 'epoch': 2.46} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4384, 'learning_rate': 0.0001968208092485549, 'epoch': 2.46} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4113, 'learning_rate': 0.0001966473988439306, 'epoch': 2.46} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4254, 'learning_rate': 0.00019647398843930636, 'epoch': 2.47} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.361, 'learning_rate': 0.00019630057803468208, 'epoch': 2.47} [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:15:58,153 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.401, 'learning_rate': 0.00019612716763005777, 'epoch': 2.47} 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3375, 'learning_rate': 0.0001959537572254335, 'epoch': 2.47} 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3193, 'learning_rate': 0.00019578034682080924, 'epoch': 2.48} 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 49%|█████████████████████████████████████ | 1102/2230 [3:44:50<4:13:31, 13.49s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3724, 'learning_rate': 0.00019543352601156068, 'epoch': 2.48} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2624, 'learning_rate': 0.00019526011560693637, 'epoch': 2.48} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3209, 'learning_rate': 0.00019508670520231212, 'epoch': 2.48} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3297, 'learning_rate': 0.00019491329479768784, 'epoch': 2.49} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2509, 'learning_rate': 0.00019473988439306356, 'epoch': 2.49} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2808, 'learning_rate': 0.00019456647398843928, 'epoch': 2.49} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3131, 'learning_rate': 0.00019439306358381503, 'epoch': 2.49} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.301, 'learning_rate': 0.00019421965317919073, 'epoch': 2.5} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2962, 'learning_rate': 0.00019404624277456645, 'epoch': 2.5} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2827, 'learning_rate': 0.00019387283236994217, 'epoch': 2.5} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2973, 'learning_rate': 0.00019369942196531792, 'epoch': 2.5} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2817, 'learning_rate': 0.00019352601156069364, 'epoch': 2.5} 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▏ | 1105/2230 [3:45:29<4:07:06, 13.18s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.253, 'learning_rate': 0.00019317919075144505, 'epoch': 2.51} 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2817, 'learning_rate': 0.0001930057803468208, 'epoch': 2.51} 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▌ | 1118/2230 [3:48:12<3:47:07, 12.25s/it] Setting `use_cache=False`...1] 2022-03-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2823, 'learning_rate': 0.00019283236994219652, 'epoch': 2.51} [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3166, 'learning_rate': 0.00019265895953757224, 'epoch': 2.52} [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:21:24,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3035, 'learning_rate': 0.00019248554913294796, 'epoch': 2.52} 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2608, 'learning_rate': 0.0001923121387283237, 'epoch': 2.52} 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3112, 'learning_rate': 0.0001921387283236994, 'epoch': 2.52} 50%|█████████████████████████████████████▊ | 1123/2230 [3:49:11<3:34:40, 11.64s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:18,081 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:18,081 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3573, 'learning_rate': 0.00019196531791907512, 'epoch': 2.52} [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2999, 'learning_rate': 0.00019179190751445084, 'epoch': 2.53} [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:22,057 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:44,203 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:54,476 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:54,476 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:22:54,476 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:00,676 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:00,676 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:00,676 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.243, 'learning_rate': 0.000191271676300578, 'epoch': 2.53} [WARNING|modeling_utils.py:388] 2022-03-22 20:23:06,771 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:09,108 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:09,108 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████ | 1131/2230 [3:50:36<3:07:02, 10.21s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████ | 1131/2230 [3:50:36<3:07:02, 10.21s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:15,043 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:17,350 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:19,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:19,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:19,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:19,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:25,129 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:27,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:29,282 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:31,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:31,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:33,288 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:35,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:37,068 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:38,977 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:38,977 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:40,771 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:42,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:42,546 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:44,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:46,089 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:49,359 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:50,945 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:50,945 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:52,592 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:55,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:57,029 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:57,029 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:58,550 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:23:59,894 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:02,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:02,734 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:05,440 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:07,852 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:07,852 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:09,031 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:11,355 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:13,456 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:13,456 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:15,466 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:17,270 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:17,270 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:19,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:19,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:20,680 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:22,971 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:22,971 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:26,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:26,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:30,195 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:33,747 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:33,747 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:33,747 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:37,391 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:37,391 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:40,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:40,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:44,501 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:47,987 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:47,987 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:47,987 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:51,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:51,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:55,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:58,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:24:58,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:02,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:02,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4007, 'learning_rate': 0.00018867052023121387, 'epoch': 2.57} [WARNING|modeling_utils.py:388] 2022-03-22 20:25:05,533 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:08,930 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:08,930 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:12,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:25:12,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3683, 'learning_rate': 0.0001884971098265896, 'epoch': 2.57} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3969, 'learning_rate': 0.00018832369942196528, 'epoch': 2.57} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.386, 'learning_rate': 0.00018815028901734103, 'epoch': 2.57} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3589, 'learning_rate': 0.00018797687861271675, 'epoch': 2.58} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3318, 'learning_rate': 0.00018780346820809247, 'epoch': 2.58} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3108, 'learning_rate': 0.0001876300578034682, 'epoch': 2.58} 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 51%|██████████████████████████████████████▌ | 1146/2230 [3:52:40<3:32:16, 11.75s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.368, 'learning_rate': 0.00018745664739884394, 'epoch': 2.58} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2717, 'learning_rate': 0.00018728323699421963, 'epoch': 2.59} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3246, 'learning_rate': 0.00018710982658959536, 'epoch': 2.59} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3484, 'learning_rate': 0.00018693641618497108, 'epoch': 2.59} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.294, 'learning_rate': 0.00018676300578034682, 'epoch': 2.59} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3197, 'learning_rate': 0.00018658959537572254, 'epoch': 2.59} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2659, 'learning_rate': 0.00018641618497109824, 'epoch': 2.6} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2214, 'learning_rate': 0.00018624277456647396, 'epoch': 2.6} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2667, 'learning_rate': 0.0001860693641618497, 'epoch': 2.6} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2771, 'learning_rate': 0.00018589595375722543, 'epoch': 2.6} 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|██████████████████████████████████████▋ | 1152/2230 [3:54:02<4:00:12, 13.37s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1162/2230 [3:56:09<3:43:15, 12.54s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1162/2230 [3:56:09<3:43:15, 12.54s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:28:51,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████ | 1163/2230 [3:56:24<3:52:17, 13.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1164/2230 [3:56:36<3:47:43, 12.82s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.286, 'learning_rate': 0.00018502890173410403, 'epoch': 2.61} 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2913, 'learning_rate': 0.00018485549132947975, 'epoch': 2.62} 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2439, 'learning_rate': 0.0001846820809248555, 'epoch': 2.62} 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▏ | 1165/2230 [3:56:48<3:43:59, 12.62s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2759, 'learning_rate': 0.0001845086705202312, 'epoch': 2.62} 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2754, 'learning_rate': 0.0001843352601156069, 'epoch': 2.62} 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2704, 'learning_rate': 0.00018416184971098263, 'epoch': 2.63} 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 52%|███████████████████████████████████████▎ | 1169/2230 [3:57:36<3:33:10, 12.06s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2555, 'learning_rate': 0.00018398843930635838, 'epoch': 2.63} 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2783, 'learning_rate': 0.0001838150289017341, 'epoch': 2.63} 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1172/2230 [3:58:10<3:26:20, 11.70s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▍ | 1174/2230 [3:58:33<3:22:40, 11.52s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3144, 'learning_rate': 0.00018346820809248552, 'epoch': 2.63} [WARNING|modeling_utils.py:388] 2022-03-22 20:31:13,020 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2811, 'learning_rate': 0.00018329479768786124, 'epoch': 2.64} [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:31:29,521 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:43,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:43,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2591, 'learning_rate': 0.00018312138728323698, 'epoch': 2.64} [WARNING|modeling_utils.py:388] 2022-03-22 20:31:43,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:43,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:43,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:53,774 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:53,774 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2926, 'learning_rate': 0.0001829479768786127, 'epoch': 2.64} [WARNING|modeling_utils.py:388] 2022-03-22 20:31:57,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:57,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:57,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:31:57,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▋ | 1179/2230 [3:59:28<3:10:02, 10.85s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 53%|███████████████████████████████████████▋ | 1179/2230 [3:59:28<3:10:02, 10.85s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3121, 'learning_rate': 0.00018277456647398843, 'epoch': 2.64} 53%|███████████████████████████████████████▋ | 1179/2230 [3:59:28<3:10:02, 10.85s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:11,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:11,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:11,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:15,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:17,833 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:20,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:20,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:20,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:20,207 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2272, 'learning_rate': 0.00018242774566473987, 'epoch': 2.65} [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:28,091 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:28,091 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:31,960 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:34,233 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:32:34,233 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1928, 'learning_rate': 0.0001822543352601156, 'epoch': 2.65} [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:38,284 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:40,399 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:42,546 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:42,546 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:44,665 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:46,720 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:48,722 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:50,683 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:50,683 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:52,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:54,604 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:56,461 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:58,270 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:32:58,270 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:00,139 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:01,870 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:03,591 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:03,591 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:06,975 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:08,527 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:10,052 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:13,088 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:13,088 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:14,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:17,271 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:18,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:18,573 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:21,257 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:22,481 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:22,481 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:24,874 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:27,005 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:29,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:29,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:30,907 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:33,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:33,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:35,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:35,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3035, 'learning_rate': 0.0001805202312138728, 'epoch': 2.67} [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:38,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:41,908 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:41,908 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:45,410 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:45,410 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:48,918 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:48,918 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.7377, 'learning_rate': 0.00018034682080924854, 'epoch': 2.67} [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:52,628 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:56,145 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:56,145 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:59,597 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:33:59,597 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:03,089 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:03,089 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.5811, 'learning_rate': 0.00018017341040462426, 'epoch': 2.68} [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:06,627 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:10,101 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:10,101 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:13,509 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:13,509 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:13,509 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:16,957 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:20,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:20,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4348, 'learning_rate': 0.0001798265895953757, 'epoch': 2.68} [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:34:23,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3972, 'learning_rate': 0.00017965317919075145, 'epoch': 2.68} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4097, 'learning_rate': 0.00017947976878612715, 'epoch': 2.69} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.386, 'learning_rate': 0.00017930635838150287, 'epoch': 2.69} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3258, 'learning_rate': 0.0001791329479768786, 'epoch': 2.69} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3626, 'learning_rate': 0.00017895953757225434, 'epoch': 2.69} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3213, 'learning_rate': 0.00017878612716763006, 'epoch': 2.7} Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2808, 'learning_rate': 0.00017861271676300575, 'epoch': 2.7} 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2454, 'learning_rate': 0.00017843930635838147, 'epoch': 2.7} 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2311, 'learning_rate': 0.00017826589595375722, 'epoch': 2.7} 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2546, 'learning_rate': 0.00017809248554913294, 'epoch': 2.7} 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2878, 'learning_rate': 0.00017791907514450866, 'epoch': 2.71} 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▍ | 1203/2230 [4:03:30<3:47:01, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1208/2230 [4:04:33<3:38:18, 12.82s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3434, 'learning_rate': 0.0001775722543352601, 'epoch': 2.71} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2801, 'learning_rate': 0.00017739884393063582, 'epoch': 2.71} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3088, 'learning_rate': 0.00017722543352601154, 'epoch': 2.72} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2821, 'learning_rate': 0.00017705202312138726, 'epoch': 2.72} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3169, 'learning_rate': 0.000176878612716763, 'epoch': 2.72} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2789, 'learning_rate': 0.0001767052023121387, 'epoch': 2.72} 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▋ | 1209/2230 [4:04:46<3:36:50, 12.74s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1741, 'learning_rate': 0.00017653179190751442, 'epoch': 2.72} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2652, 'learning_rate': 0.00017635838150289015, 'epoch': 2.73} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2357, 'learning_rate': 0.0001761849710982659, 'epoch': 2.73} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2809, 'learning_rate': 0.00017601156069364161, 'epoch': 2.73} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2503, 'learning_rate': 0.0001758381502890173, 'epoch': 2.73} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2239, 'learning_rate': 0.00017566473988439303, 'epoch': 2.74} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2508, 'learning_rate': 0.00017549132947976878, 'epoch': 2.74} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2548, 'learning_rate': 0.0001753179190751445, 'epoch': 2.74} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3294, 'learning_rate': 0.00017514450867052022, 'epoch': 2.74} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.272, 'learning_rate': 0.00017497109826589594, 'epoch': 2.74} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.229, 'learning_rate': 0.00017479768786127169, 'epoch': 2.75} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2473, 'learning_rate': 0.00017462427745664738, 'epoch': 2.75} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2432, 'learning_rate': 0.0001744508670520231, 'epoch': 2.75} 54%|████████████████████████████████████████▊ | 1215/2230 [4:06:05<3:44:14, 13.26s/it] Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:15,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:15,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:15,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:15,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:15,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3006, 'learning_rate': 0.00017427745664739882, 'epoch': 2.75} [WARNING|modeling_utils.py:388] 2022-03-22 20:41:26,334 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:26,334 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:26,334 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▎ | 1229/2230 [4:08:54<3:06:59, 11.21s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▎ | 1229/2230 [4:08:54<3:06:59, 11.21s/it]g-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:34,027 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:34,027 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:34,027 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:40,198 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:40,198 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:40,198 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2355, 'learning_rate': 0.0001739306358381503, 'epoch': 2.76} [WARNING|modeling_utils.py:388] 2022-03-22 20:41:46,397 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:41:46,397 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:50,738 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:50,738 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:50,738 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2419, 'learning_rate': 0.00017375722543352598, 'epoch': 2.76} [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:41:56,693 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:02,556 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:02,556 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:42:06,363 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:42:08,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:42:10,814 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:42:10,814 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2763, 'learning_rate': 0.00017341040462427745, 'epoch': 2.76} [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:14,704 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:16,837 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:15:51,390 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1234/2230 [4:09:41<2:37:06, 9.46s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:18,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1234/2230 [4:09:41<2:37:06, 9.46s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:18,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:20,879 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:18,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:22,803 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:18,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:24,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:18,920 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1235/2230 [4:09:49<2:28:22, 8.95s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1235/2230 [4:09:49<2:28:22, 8.95s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:28,502 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:30,263 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1236/2230 [4:09:56<2:19:23, 8.41s/it] Setting `use_cache=False`...1] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1236/2230 [4:09:56<2:19:23, 8.41s/it] Setting `use_cache=False`...1] 2022-03-22 20:42:26,619 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:35,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:36,931 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:38,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:38,486 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 55%|█████████████████████████████████████████▌ | 1237/2230 [4:10:03<2:09:27, 7.82s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:40,106 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:41,614 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:40,106 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:44,602 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:40,106 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▋ | 1238/2230 [4:10:10<2:06:59, 7.68s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:47,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▋ | 1238/2230 [4:10:10<2:06:59, 7.68s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:42:47,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:48,768 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:47,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:51,209 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:47,463 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:53,551 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:52,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:53,551 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:52,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:55,702 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:52,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:57,706 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:56,763 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:57,706 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:56,763 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:42:59,514 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:42:56,763 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:01,321 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:00,461 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:01,321 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:00,461 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1242/2230 [4:10:26<1:18:57, 4.80s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:00,461 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1242/2230 [4:10:26<1:18:57, 4.80s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:00,461 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1242/2230 [4:10:26<1:18:57, 4.80s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1242/2230 [4:10:26<1:18:57, 4.80s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:08,345 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:11,978 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:11,978 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:15,533 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:15,533 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1243/2230 [4:10:41<2:06:35, 7.70s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:04,732 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1243/2230 [4:10:41<2:06:35, 7.70s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1243/2230 [4:10:41<2:06:35, 7.70s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:22,688 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:26,212 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:26,212 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:29,653 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:29,653 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1244/2230 [4:10:55<2:38:04, 9.62s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:19,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1244/2230 [4:10:55<2:38:04, 9.62s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:36,748 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:36,748 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:40,260 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:40,260 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:43,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:43,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1245/2230 [4:11:09<2:59:56, 10.96s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:33,254 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|█████████████████████████████████████████▊ | 1245/2230 [4:11:09<2:59:56, 10.96s/it][WARNING|modeling_bart.py:1051] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3472, 'learning_rate': 0.00017115606936416185, 'epoch': 2.79} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3903, 'learning_rate': 0.00017098265895953757, 'epoch': 2.8} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3271, 'learning_rate': 0.00017080924855491326, 'epoch': 2.8} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3256, 'learning_rate': 0.000170635838150289, 'epoch': 2.8} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3254, 'learning_rate': 0.00017046242774566473, 'epoch': 2.8} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3303, 'learning_rate': 0.00017028901734104045, 'epoch': 2.8} [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:43:50,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3209, 'learning_rate': 0.00017011560693641617, 'epoch': 2.81} 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.27, 'learning_rate': 0.00016994219653179192, 'epoch': 2.81} 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2704, 'learning_rate': 0.0001697687861271676, 'epoch': 2.81} 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████ | 1252/2230 [4:12:46<3:40:27, 13.53s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3389, 'learning_rate': 0.00016959537572254333, 'epoch': 2.81} 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.254, 'learning_rate': 0.00016942196531791905, 'epoch': 2.82} 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2372, 'learning_rate': 0.0001692485549132948, 'epoch': 2.82} 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2977, 'learning_rate': 0.00016907514450867052, 'epoch': 2.82} 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▏ | 1255/2230 [4:13:25<3:33:48, 13.16s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3098, 'learning_rate': 0.00016890173410404622, 'epoch': 2.82} 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 56%|██████████████████████████████████████████▎ | 1259/2230 [4:14:16<3:27:13, 12.81s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2283, 'learning_rate': 0.00016872832369942194, 'epoch': 2.83} 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▍ | 1260/2230 [4:14:28<3:25:42, 12.72s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2234, 'learning_rate': 0.00016855491329479768, 'epoch': 2.83} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2372, 'learning_rate': 0.0001683815028901734, 'epoch': 2.83} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2552, 'learning_rate': 0.00016820809248554913, 'epoch': 2.83} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2208, 'learning_rate': 0.00016803468208092482, 'epoch': 2.83} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2961, 'learning_rate': 0.00016786127167630057, 'epoch': 2.84} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2374, 'learning_rate': 0.0001676878612716763, 'epoch': 2.84} Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2974, 'learning_rate': 0.000167514450867052, 'epoch': 2.84} 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2306, 'learning_rate': 0.00016734104046242773, 'epoch': 2.84} 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2881, 'learning_rate': 0.00016716763005780348, 'epoch': 2.85} 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1941, 'learning_rate': 0.0001669942196531792, 'epoch': 2.85} 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▌ | 1267/2230 [4:15:56<3:18:36, 12.37s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▋ | 1271/2230 [4:16:43<3:09:16, 11.84s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▋ | 1271/2230 [4:16:43<3:09:16, 11.84s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▋ | 1271/2230 [4:16:43<3:09:16, 11.84s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▋ | 1271/2230 [4:16:43<3:09:16, 11.84s/it] Setting `use_cache=False`...1] 2022-03-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2499, 'learning_rate': 0.0001666473988439306, 'epoch': 2.85} [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:49:28,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2944, 'learning_rate': 0.00016647398843930633, 'epoch': 2.85} 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2228, 'learning_rate': 0.00016630057803468208, 'epoch': 2.86} 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▊ | 1273/2230 [4:17:06<3:05:10, 11.61s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:03,528 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:03,528 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:03,528 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:03,528 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:09,526 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:09,526 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:09,526 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:09,526 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▉ | 1276/2230 [4:17:40<3:03:32, 11.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 57%|██████████████████████████████████████████▉ | 1276/2230 [4:17:40<3:03:32, 11.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2833, 'learning_rate': 0.0001659537572254335, 'epoch': 2.86} 57%|██████████████████████████████████████████▉ | 1276/2230 [4:17:40<3:03:32, 11.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:24,062 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:24,062 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:28,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:28,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2251, 'learning_rate': 0.00016578034682080922, 'epoch': 2.86} [WARNING|modeling_utils.py:388] 2022-03-22 20:50:28,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:28,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:28,006 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:50:38,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:50:38,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2508, 'learning_rate': 0.00016560693641618496, 'epoch': 2.87} [WARNING|modeling_bart.py:1051] 2022-03-22 20:50:38,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:50:38,034 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:46,003 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:46,003 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:46,003 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2835, 'learning_rate': 0.00016543352601156068, 'epoch': 2.87} [WARNING|modeling_utils.py:388] 2022-03-22 20:50:52,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:52,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:52,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:58,244 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:50:58,244 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2011, 'learning_rate': 0.0001652601156069364, 'epoch': 2.87} [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:02,726 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:02,726 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:06,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:06,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:06,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:10,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:10,917 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:16,973 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:16,973 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2861, 'learning_rate': 0.00016491329479768785, 'epoch': 2.87} [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:20,992 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:23,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:23,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:51:23,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:26,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:28,602 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:30,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:32,489 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:34,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:34,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:36,338 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:38,175 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:39,985 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:39,985 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:41,791 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:45,109 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:46,741 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:46,741 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:48,407 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:49,959 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:53,033 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:53,033 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:54,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:55,965 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:51:58,792 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:01,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:01,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:02,732 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:05,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:05,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:07,356 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:09,463 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:11,420 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:11,420 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:13,246 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:15,014 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:15,014 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:16,556 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:16,556 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:18,843 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:18,843 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:22,511 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:22,511 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:26,050 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:29,569 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:29,569 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:29,569 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:33,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:33,191 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:36,696 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:36,696 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:40,187 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:43,708 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:43,708 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4315, 'learning_rate': 0.00016283236994219652, 'epoch': 2.9} [WARNING|modeling_utils.py:388] 2022-03-22 20:52:47,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:47,243 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:50,700 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:54,199 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:54,199 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:57,593 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:52:57,593 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3455, 'learning_rate': 0.00016265895953757224, 'epoch': 2.9} [WARNING|modeling_utils.py:388] 2022-03-22 20:53:01,117 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:01,117 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:04,563 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3502, 'learning_rate': 0.00016248554913294796, 'epoch': 2.91} [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:53:07,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3384, 'learning_rate': 0.00016231213872832368, 'epoch': 2.91} 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3879, 'learning_rate': 0.00016213872832369943, 'epoch': 2.91} 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 58%|███████████████████████████████████████████▌ | 1297/2230 [4:20:50<3:11:42, 12.33s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3678, 'learning_rate': 0.00016196531791907512, 'epoch': 2.91} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3088, 'learning_rate': 0.00016179190751445085, 'epoch': 2.91} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2579, 'learning_rate': 0.00016161849710982657, 'epoch': 2.92} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2767, 'learning_rate': 0.00016144508670520231, 'epoch': 2.92} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2344, 'learning_rate': 0.00016127167630057803, 'epoch': 2.92} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2729, 'learning_rate': 0.00016109826589595373, 'epoch': 2.92} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2734, 'learning_rate': 0.00016092485549132945, 'epoch': 2.93} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2517, 'learning_rate': 0.0001607514450867052, 'epoch': 2.93} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.228, 'learning_rate': 0.00016057803468208092, 'epoch': 2.93} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2728, 'learning_rate': 0.00016040462427745664, 'epoch': 2.93} g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1309/2230 [4:23:28<3:14:34, 12.68s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1310/2230 [4:23:40<3:12:52, 12.58s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████ | 1311/2230 [4:23:52<3:11:38, 12.51s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.15, 'learning_rate': 0.00015971098265895952, 'epoch': 2.94} 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1312/2230 [4:24:05<3:10:32, 12.45s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2304, 'learning_rate': 0.00015953757225433524, 'epoch': 2.94} 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1313/2230 [4:24:19<3:17:34, 12.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2228, 'learning_rate': 0.000159364161849711, 'epoch': 2.95} 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2411, 'learning_rate': 0.00015919075144508668, 'epoch': 2.95} 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2497, 'learning_rate': 0.0001590173410404624, 'epoch': 2.95} 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2196, 'learning_rate': 0.00015884393063583812, 'epoch': 2.95} 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2327, 'learning_rate': 0.00015867052023121387, 'epoch': 2.96} 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▏ | 1314/2230 [4:24:31<3:14:38, 12.75s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2496, 'learning_rate': 0.0001584971098265896, 'epoch': 2.96} 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2477, 'learning_rate': 0.0001583236994219653, 'epoch': 2.96} 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 59%|████████████████████████████████████████████▎ | 1319/2230 [4:25:30<3:01:03, 11.93s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3474, 'learning_rate': 0.000158150289017341, 'epoch': 2.96} [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2032, 'learning_rate': 0.00015797687861271675, 'epoch': 2.96} [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:58:30,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:58:50,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:58:50,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:58:50,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2555, 'learning_rate': 0.00015780346820809248, 'epoch': 2.97} [WARNING|modeling_bart.py:1051] 2022-03-22 20:58:50,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:58:50,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2592, 'learning_rate': 0.0001576300578034682, 'epoch': 2.97} [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:00,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2183, 'learning_rate': 0.00015745664739884392, 'epoch': 2.97} [WARNING|modeling_utils.py:388] 2022-03-22 20:59:20,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:20,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:20,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:20,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:20,786 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2331, 'learning_rate': 0.00015728323699421966, 'epoch': 2.97} [WARNING|modeling_utils.py:388] 2022-03-22 20:59:31,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:31,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:31,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:31,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:31,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1846, 'learning_rate': 0.00015710982658959536, 'epoch': 2.98} [WARNING|modeling_utils.py:388] 2022-03-22 20:59:41,093 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:43,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:43,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:43,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 20:59:43,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2293, 'learning_rate': 0.00015693641618497108, 'epoch': 2.98} [WARNING|modeling_bart.py:1051] 2022-03-22 20:59:51,427 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 20:59:53,740 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 60%|████████████████████████████████████████████▋ | 1329/2230 [4:27:18<2:33:32, 10.22s/it] Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 60%|████████████████████████████████████████████▋ | 1329/2230 [4:27:18<2:33:32, 10.22s/it] Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.194, 'learning_rate': 0.0001567630057803468, 'epoch': 2.98} [WARNING|modeling_bart.py:1051] 2022-03-22 20:59:59,429 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:00:01,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:00:03,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:00:03,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2433, 'learning_rate': 0.00015658959537572255, 'epoch': 2.98} [WARNING|modeling_utils.py:388] 2022-03-22 21:00:07,325 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:09,350 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:11,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:11,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:13,320 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:15,226 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:17,076 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:18,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:18,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:20,756 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:22,495 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:25,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:25,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:27,526 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:29,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:32,013 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:32,013 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:33,450 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:36,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:36,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:37,262 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:39,653 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:41,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:41,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:43,769 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:45,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:45,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:47,348 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:50,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:50,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3316, 'learning_rate': 0.0001552023121387283, 'epoch': 3.0} [WARNING|modeling_utils.py:388] 2022-03-22 21:00:52,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:52,875 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:00:56,503 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:00,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:00,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:03,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:03,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3798, 'learning_rate': 0.00015502890173410403, 'epoch': 3.0} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:07,180 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:10,600 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:10,600 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:14,097 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:14,097 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:17,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:17,547 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3203, 'learning_rate': 0.00015485549132947975, 'epoch': 3.0} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:21,028 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:24,524 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:24,524 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:27,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:27,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:31,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:31,382 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2678, 'learning_rate': 0.00015450867052023122, 'epoch': 3.01} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2495, 'learning_rate': 0.00015433526011560692, 'epoch': 3.01} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2213, 'learning_rate': 0.00015416184971098264, 'epoch': 3.01} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2451, 'learning_rate': 0.00015398843930635836, 'epoch': 3.02} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2073, 'learning_rate': 0.0001538150289017341, 'epoch': 3.02} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.205, 'learning_rate': 0.00015364161849710983, 'epoch': 3.02} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2282, 'learning_rate': 0.00015346820809248555, 'epoch': 3.02} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1877, 'learning_rate': 0.00015329479768786124, 'epoch': 3.02} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1867, 'learning_rate': 0.000153121387283237, 'epoch': 3.03} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1768, 'learning_rate': 0.0001529479768786127, 'epoch': 3.03} [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:01:34,879 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1304, 'learning_rate': 0.00015277456647398843, 'epoch': 3.03} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1706, 'learning_rate': 0.00015260115606936415, 'epoch': 3.03} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2271, 'learning_rate': 0.0001524277456647399, 'epoch': 3.04} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1508, 'learning_rate': 0.0001522543352601156, 'epoch': 3.04} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1696, 'learning_rate': 0.0001520809248554913, 'epoch': 3.04} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1688, 'learning_rate': 0.00015190751445086703, 'epoch': 3.04} 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▍ | 1352/2230 [4:31:23<3:13:04, 13.19s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2154, 'learning_rate': 0.00015173410404624278, 'epoch': 3.04} 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1358/2230 [4:32:38<3:02:15, 12.54s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1489, 'learning_rate': 0.0001515606936416185, 'epoch': 3.05} 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1359/2230 [4:32:51<3:01:21, 12.49s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1893, 'learning_rate': 0.0001513872832369942, 'epoch': 3.05} 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.188, 'learning_rate': 0.00015121387283236992, 'epoch': 3.05} 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▋ | 1360/2230 [4:33:03<2:59:49, 12.40s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.172, 'learning_rate': 0.00015104046242774566, 'epoch': 3.05} 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1528, 'learning_rate': 0.00015086705202312138, 'epoch': 3.06} 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▊ | 1362/2230 [4:33:27<2:57:38, 12.28s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:25,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:25,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:25,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:31,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:31,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1574, 'learning_rate': 0.0001506936416184971, 'epoch': 3.06} [WARNING|modeling_utils.py:388] 2022-03-22 21:06:31,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:31,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:31,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1721, 'learning_rate': 0.0001505202312138728, 'epoch': 3.06} [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:06:41,738 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1366, 'learning_rate': 0.00015034682080924855, 'epoch': 3.06} 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1438, 'learning_rate': 0.00015017341040462427, 'epoch': 3.07} 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|█████████████████████████████████████████████▉ | 1366/2230 [4:34:16<2:54:26, 12.11s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1591, 'learning_rate': 0.00015, 'epoch': 3.07} [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:07:16,457 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1695, 'learning_rate': 0.0001498265895953757, 'epoch': 3.07} 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1676, 'learning_rate': 0.00014965317919075143, 'epoch': 3.07} 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1333, 'learning_rate': 0.00014947976878612715, 'epoch': 3.07} 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 61%|██████████████████████████████████████████████ | 1369/2230 [4:34:51<2:47:19, 11.66s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1445, 'learning_rate': 0.0001491329479768786, 'epoch': 3.08} 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▏ | 1372/2230 [4:35:24<2:40:15, 11.21s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:19,389 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:19,389 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:19,389 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1443, 'learning_rate': 0.0001489595375722543, 'epoch': 3.08} [WARNING|modeling_utils.py:388] 2022-03-22 21:08:25,841 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:25,841 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:25,841 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1344, 'learning_rate': 0.00014878612716763003, 'epoch': 3.08} [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:08:31,685 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1514, 'learning_rate': 0.00014861271676300578, 'epoch': 3.09} [WARNING|modeling_utils.py:388] 2022-03-22 21:08:47,549 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:49,887 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:49,887 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:49,887 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▎ | 1377/2230 [4:36:16<2:28:18, 10.43s/it]g-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:55,784 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:08:58,048 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:09:00,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:09:00,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:09:00,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:04,410 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:06,565 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:08,680 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:08,680 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:08,680 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.4735, 'learning_rate': 0.00014809248554913291, 'epoch': 3.09} [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:13,939 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:15,939 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:17,947 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:17,947 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 20:43:47,335 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▍ | 1380/2230 [4:36:42<2:09:50, 9.17s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:09:19,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:21,838 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:19,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:23,723 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:19,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:25,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:19,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:25,572 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:19,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▍ | 1381/2230 [4:36:50<2:02:49, 8.68s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:09:27,449 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:29,239 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:27,449 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:32,655 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:27,449 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▍ | 1382/2230 [4:36:57<1:55:20, 8.16s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▍ | 1382/2230 [4:36:57<1:55:20, 8.16s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:35,984 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:37,528 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▌ | 1383/2230 [4:37:03<1:47:42, 7.63s/it] Setting `use_cache=False`...1] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▌ | 1383/2230 [4:37:03<1:47:42, 7.63s/it] Setting `use_cache=False`...1] 2022-03-22 21:09:34,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:42,148 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:40,689 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:43,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:40,689 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:43,529 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:40,689 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▌ | 1384/2230 [4:37:09<1:39:14, 7.04s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:09:46,279 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:47,530 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:46,279 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:49,961 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:46,279 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:49,961 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:46,279 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:52,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:51,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:54,527 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:51,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:54,527 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:51,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:56,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:55,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:58,405 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:55,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:09:58,405 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:55,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:00,201 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:59,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:02,454 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:59,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:02,454 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:09:59,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1388/2230 [4:37:27<1:11:03, 5.06s/it] Setting `use_cache=False`...1] 2022-03-22 21:09:59,349 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1388/2230 [4:37:27<1:11:03, 5.06s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:08,806 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:08,806 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:12,394 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:12,394 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:15,984 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1389/2230 [4:37:41<1:51:12, 7.93s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1389/2230 [4:37:41<1:51:12, 7.93s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:05,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1389/2230 [4:37:41<1:51:12, 7.93s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:22,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:22,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:26,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:26,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:30,066 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1390/2230 [4:37:55<2:16:53, 9.78s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1390/2230 [4:37:55<2:16:53, 9.78s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:19,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▋ | 1390/2230 [4:37:55<2:16:53, 9.78s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:37,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:37,115 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:40,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:40,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:43,914 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:43,914 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:33,645 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▊ | 1391/2230 [4:38:09<2:33:19, 10.96s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 62%|██████████████████████████████████████████████▊ | 1391/2230 [4:38:09<2:33:19, 10.96s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2823, 'learning_rate': 0.00014583815028901734, 'epoch': 3.12} [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2574, 'learning_rate': 0.00014566473988439306, 'epoch': 3.12} [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2407, 'learning_rate': 0.00014549132947976878, 'epoch': 3.13} [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:10:50,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2446, 'learning_rate': 0.0001453179190751445, 'epoch': 3.13} 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2939, 'learning_rate': 0.00014514450867052022, 'epoch': 3.13} 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1395/2230 [4:39:03<2:59:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1817, 'learning_rate': 0.00014497109826589594, 'epoch': 3.13} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2134, 'learning_rate': 0.00014479768786127166, 'epoch': 3.13} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.177, 'learning_rate': 0.00014462427745664738, 'epoch': 3.14} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2093, 'learning_rate': 0.0001444508670520231, 'epoch': 3.14} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.204, 'learning_rate': 0.00014427745664739882, 'epoch': 3.14} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1778, 'learning_rate': 0.00014410404624277454, 'epoch': 3.14} 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|██████████████████████████████████████████████▉ | 1397/2230 [4:39:30<3:02:07, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1262, 'learning_rate': 0.00014393063583815026, 'epoch': 3.15} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1809, 'learning_rate': 0.000143757225433526, 'epoch': 3.15} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2144, 'learning_rate': 0.0001435838150289017, 'epoch': 3.15} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.208, 'learning_rate': 0.00014341040462427745, 'epoch': 3.15} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1815, 'learning_rate': 0.00014323699421965317, 'epoch': 3.15} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1736, 'learning_rate': 0.0001430635838150289, 'epoch': 3.16} 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▏ | 1403/2230 [4:40:50<3:01:10, 13.14s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1365, 'learning_rate': 0.00014289017341040462, 'epoch': 3.16} 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1409/2230 [4:42:05<2:51:58, 12.57s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.192, 'learning_rate': 0.00014271676300578034, 'epoch': 3.16} 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1793, 'learning_rate': 0.00014254335260115606, 'epoch': 3.16} 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1523, 'learning_rate': 0.00014236994219653178, 'epoch': 3.17} 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▍ | 1410/2230 [4:42:17<2:50:44, 12.49s/it] Setting `use_cache=False`...1] 2022-03-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1728, 'learning_rate': 0.00014202312138728322, 'epoch': 3.17} [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1174, 'learning_rate': 0.00014184971098265894, 'epoch': 3.17} [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:15:35,670 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1724, 'learning_rate': 0.00014167630057803466, 'epoch': 3.17} 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 63%|███████████████████████████████████████████████▌ | 1416/2230 [4:43:31<2:44:55, 12.16s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1748, 'learning_rate': 0.00014132947976878613, 'epoch': 3.18} [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:22,749 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▋ | 1419/2230 [4:44:06<2:38:33, 11.73s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▋ | 1419/2230 [4:44:06<2:38:33, 11.73s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1524, 'learning_rate': 0.00014115606936416182, 'epoch': 3.18} [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1267, 'learning_rate': 0.00014098265895953757, 'epoch': 3.18} [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:16:47,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1518, 'learning_rate': 0.0001408092485549133, 'epoch': 3.19} 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1754, 'learning_rate': 0.000140635838150289, 'epoch': 3.19} 64%|███████████████████████████████████████████████▊ | 1421/2230 [4:44:28<2:34:01, 11.42s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1313, 'learning_rate': 0.00014046242774566473, 'epoch': 3.19} [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:21,241 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:35,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:35,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:35,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1822, 'learning_rate': 0.00014028901734104045, 'epoch': 3.19} [WARNING|modeling_utils.py:388] 2022-03-22 21:17:35,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:35,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:46,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:46,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:46,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:46,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.134, 'learning_rate': 0.00014011560693641617, 'epoch': 3.2} [WARNING|modeling_utils.py:388] 2022-03-22 21:17:54,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:54,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:17:54,139 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:00,329 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:00,329 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1302, 'learning_rate': 0.0001399421965317919, 'epoch': 3.2} [WARNING|modeling_utils.py:388] 2022-03-22 21:18:00,329 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:06,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:06,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:06,345 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:10,687 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:10,687 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:14,583 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:16,906 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:16,906 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1463, 'learning_rate': 0.00013959537572254334, 'epoch': 3.2} [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:21,054 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:21,054 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:24,762 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:26,907 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:26,907 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:29,074 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:31,136 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:33,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:35,141 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:35,141 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:37,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:18:37,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:40,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:40,617 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:42,444 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:44,310 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:46,082 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:47,768 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:47,768 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:51,110 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:52,705 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:54,283 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:54,283 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:57,389 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:18:58,809 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:01,510 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:01,510 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:02,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:05,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:05,365 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:06,532 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:08,886 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:11,021 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:11,021 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:13,024 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:13,024 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:15,823 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:16,632 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:19,661 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:19,661 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1929, 'learning_rate': 0.00013786127167630057, 'epoch': 3.22} [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:23,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:23,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:26,614 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:30,178 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:30,178 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:33,710 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3476, 'learning_rate': 0.0001376878612716763, 'epoch': 3.23} [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:37,329 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:37,329 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:40,816 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:44,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:44,330 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:47,825 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:47,825 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3686, 'learning_rate': 0.000137514450867052, 'epoch': 3.23} [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:51,403 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:51,403 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:54,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:58,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:19:58,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:01,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:01,760 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2973, 'learning_rate': 0.00013734104046242773, 'epoch': 3.23} [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:05,286 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2733, 'learning_rate': 0.00013716763005780345, 'epoch': 3.23} [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:20:08,649 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2181, 'learning_rate': 0.00013699421965317917, 'epoch': 3.24} 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2085, 'learning_rate': 0.00013682080924855492, 'epoch': 3.24} 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2227, 'learning_rate': 0.00013664739884393061, 'epoch': 3.24} 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2176, 'learning_rate': 0.00013647398843930636, 'epoch': 3.24} 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▌ | 1443/2230 [4:47:53<2:41:46, 12.33s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2107, 'learning_rate': 0.00013630057803468206, 'epoch': 3.24} Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1943, 'learning_rate': 0.0001361271676300578, 'epoch': 3.25} Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1552, 'learning_rate': 0.00013595375722543352, 'epoch': 3.25} 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2, 'learning_rate': 0.00013578034682080925, 'epoch': 3.25} 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2198, 'learning_rate': 0.00013560693641618497, 'epoch': 3.25} 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▋ | 1449/2230 [4:49:12<2:50:36, 13.11s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1381, 'learning_rate': 0.0001354335260115607, 'epoch': 3.26} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2021, 'learning_rate': 0.0001352601156069364, 'epoch': 3.26} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.19, 'learning_rate': 0.00013508670520231213, 'epoch': 3.26} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1655, 'learning_rate': 0.00013491329479768785, 'epoch': 3.26} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1811, 'learning_rate': 0.00013473988439306357, 'epoch': 3.26} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1644, 'learning_rate': 0.0001345664739884393, 'epoch': 3.27} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1778, 'learning_rate': 0.00013439306358381504, 'epoch': 3.27} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1469, 'learning_rate': 0.00013421965317919073, 'epoch': 3.27} 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|████████████████████████████████████████████████▊ | 1452/2230 [4:49:53<2:52:32, 13.31s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1454, 'learning_rate': 0.00013404624277456648, 'epoch': 3.27} 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 65%|█████████████████████████████████████████████████ | 1460/2230 [4:51:34<2:40:28, 12.50s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.194, 'learning_rate': 0.00013387283236994217, 'epoch': 3.28} 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1461, 'learning_rate': 0.00013369942196531792, 'epoch': 3.28} 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1489, 'learning_rate': 0.00013352601156069364, 'epoch': 3.28} 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1083, 'learning_rate': 0.00013335260115606936, 'epoch': 3.28} 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1315, 'learning_rate': 0.00013317919075144508, 'epoch': 3.28} 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▏ | 1461/2230 [4:51:46<2:39:02, 12.41s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▎ | 1466/2230 [4:52:47<2:34:46, 12.15s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▎ | 1466/2230 [4:52:47<2:34:46, 12.15s/it] Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.176, 'learning_rate': 0.00013300578034682078, 'epoch': 3.29} [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3109, 'learning_rate': 0.00013300578034682078, 'epoch': 3.29} [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1548, 'learning_rate': 0.00013283236994219652, 'epoch': 3.29} [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2042, 'learning_rate': 0.00013265895953757224, 'epoch': 3.29} [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1323, 'learning_rate': 0.00013248554913294797, 'epoch': 3.3} [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:25:28,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1398, 'learning_rate': 0.00013231213872832369, 'epoch': 3.3} 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▍ | 1471/2230 [4:53:44<2:25:03, 11.47s/it]g-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1565, 'learning_rate': 0.0001321387283236994, 'epoch': 3.3} [WARNING|modeling_utils.py:388] 2022-03-22 21:26:36,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:36,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:36,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:36,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:36,400 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:26:44,354 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:26:56,856 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1239, 'learning_rate': 0.00013161849710982657, 'epoch': 3.31} [WARNING|modeling_utils.py:388] 2022-03-22 21:27:10,639 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:10,639 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:10,639 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:10,639 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:16,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:16,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:16,910 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:22,917 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:25,273 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:25,273 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1207, 'learning_rate': 0.000131271676300578, 'epoch': 3.31} [WARNING|modeling_utils.py:388] 2022-03-22 21:27:25,273 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:31,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:31,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:31,213 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:10:47,352 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▋ | 1478/2230 [4:54:58<2:07:56, 10.21s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 66%|█████████████████████████████████████████████████▋ | 1478/2230 [4:54:58<2:07:56, 10.21s/it][WARNING|modeling_bart.py:1051] 2022-03-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:39,252 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:41,450 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:43,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:43,611 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:45,766 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:47,820 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:49,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:51,820 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:51,820 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:53,865 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:55,839 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:57,732 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:57,732 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:27:59,559 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:01,444 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:03,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:04,964 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:04,964 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:06,613 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:09,943 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:11,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:11,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:12,971 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:15,906 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:17,277 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:17,277 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:20,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:21,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:21,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:23,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:26,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:26,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:28,174 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:30,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:30,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:32,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:33,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:36,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:36,276 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:37,024 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:37,024 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:40,807 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:44,403 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:44,403 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:47,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:47,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:47,982 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:51,595 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:51,595 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:55,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:55,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:28:58,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:02,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:02,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:02,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:05,818 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:05,818 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:09,408 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:12,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:12,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:16,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:16,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:16,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:19,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.227, 'learning_rate': 0.00012867052023121387, 'epoch': 3.35} [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1851, 'learning_rate': 0.00012849710982658957, 'epoch': 3.35} [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1939, 'learning_rate': 0.00012832369942196532, 'epoch': 3.35} [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:29:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2472, 'learning_rate': 0.00012815028901734104, 'epoch': 3.35} 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1952, 'learning_rate': 0.00012797687861271676, 'epoch': 3.35} 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1495/2230 [4:57:37<2:38:59, 12.98s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1977, 'learning_rate': 0.00012780346820809248, 'epoch': 3.36} 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2138, 'learning_rate': 0.0001276300578034682, 'epoch': 3.36} 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▎ | 1497/2230 [4:58:04<2:40:31, 13.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2302, 'learning_rate': 0.00012745664739884392, 'epoch': 3.36} 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|██████████████████████████████████████████████████▍ | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1376, 'learning_rate': 0.00012728323699421964, 'epoch': 3.36} [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/22/2022 21:40:52 - INFO - datasets.metric - Removing /home/sanchit_huggingface_co/.cache/huggingface/metrics/wer/default/default_experiment-1-0.arrow {'eval_loss': 0.33944663405418396, 'eval_wer': 0.10464101547005157, 'eval_runtime': 570.7817, 'eval_samples_per_second': 4.629, 'eval_steps_per_second': 0.58, 'epoch': 3.36} [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1535, 'learning_rate': 0.00012710982658959536, 'epoch': 3.37} [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.159, 'learning_rate': 0.00012693641618497108, 'epoch': 3.37} [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1576, 'learning_rate': 0.0001267630057803468, 'epoch': 3.37} [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-22 21:31:22,127 >> Batch size = 8 | 1499/2230 [4:58:30<2:40:21, 13.16s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1656, 'learning_rate': 0.00012658959537572252, 'epoch': 3.37} 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1746, 'learning_rate': 0.00012641618497109824, 'epoch': 3.37} 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 67%|█████████████████████████████████████████████████▉ | 1504/2230 [5:10:44<16:30:02, 81.82s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1503, 'learning_rate': 0.000126242774566474, 'epoch': 3.38} 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1322, 'learning_rate': 0.00012606936416184968, 'epoch': 3.38} 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1634, 'learning_rate': 0.00012589595375722543, 'epoch': 3.38} 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1521, 'learning_rate': 0.00012572254335260115, 'epoch': 3.38} 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1227, 'learning_rate': 0.00012554913294797687, 'epoch': 3.39} 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▋ | 1506/2230 [5:11:10<9:25:07, 46.83s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1433, 'learning_rate': 0.0001253757225433526, 'epoch': 3.39} 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1511/2230 [5:12:16<3:45:00, 18.78s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1242, 'learning_rate': 0.00012520231213872831, 'epoch': 3.39} 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1352, 'learning_rate': 0.00012502890173410404, 'epoch': 3.39} 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1243, 'learning_rate': 0.00012485549132947976, 'epoch': 3.39} 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1467, 'learning_rate': 0.00012468208092485548, 'epoch': 3.4} 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▊ | 1512/2230 [5:12:28<3:21:39, 16.85s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1349, 'learning_rate': 0.0001245086705202312, 'epoch': 3.4} 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|██████████████████████████████████████████████████▉ | 1516/2230 [5:13:18<2:37:14, 13.21s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1183, 'learning_rate': 0.00012433526011560692, 'epoch': 3.4} [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1427, 'learning_rate': 0.00012416184971098267, 'epoch': 3.4} [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:46:11,479 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1718, 'learning_rate': 0.0001238150289017341, 'epoch': 3.41} 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 68%|███████████████████████████████████████████████████ | 1519/2230 [5:13:52<2:23:08, 12.08s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1485, 'learning_rate': 0.0001236416184971098, 'epoch': 3.41} [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:46:50,222 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1333, 'learning_rate': 0.00012346820809248555, 'epoch': 3.41} [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.149, 'learning_rate': 0.00012329479768786127, 'epoch': 3.41} [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1364, 'learning_rate': 0.000123121387283237, 'epoch': 3.42} [WARNING|modeling_utils.py:388] 2022-03-22 21:47:02,452 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:28,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:28,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:28,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:28,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:28,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1184, 'learning_rate': 0.0001229479768786127, 'epoch': 3.42} [WARNING|modeling_bart.py:1051] 2022-03-22 21:47:39,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:47:39,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:47:39,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:47:39,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:47:39,303 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:47,315 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:47,315 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:47,315 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:53,408 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:55,857 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:47:55,857 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1418, 'learning_rate': 0.00012260115606936415, 'epoch': 3.42} [WARNING|modeling_utils.py:388] 2022-03-22 21:47:55,857 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:01,839 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:01,839 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▍ | 1528/2230 [5:15:28<2:00:41, 10.31s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▍ | 1528/2230 [5:15:28<2:00:41, 10.31s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1314, 'learning_rate': 0.00012242774566473987, 'epoch': 3.43} [WARNING|modeling_bart.py:1051] 2022-03-22 21:48:09,616 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:48:11,845 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:48:11,845 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:48:11,845 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:15,627 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:17,768 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:19,876 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:21,951 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:24,052 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:24,052 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:26,013 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:27,890 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:29,783 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:29,783 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:31,724 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:33,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:35,280 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:37,015 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:37,015 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:40,451 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:42,129 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:43,694 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:43,694 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:45,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:48,247 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:49,682 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:49,682 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:52,485 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:53,712 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:56,079 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:56,079 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:48:58,275 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:00,368 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:00,368 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:02,271 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:04,176 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:04,176 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:05,034 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:08,116 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:08,116 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:10,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:10,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:14,053 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:14,053 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:17,718 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:17,718 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:21,197 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:21,197 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:24,833 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:24,833 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:28,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:28,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:31,885 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:35,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:35,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:35,333 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:38,865 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:38,865 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:42,280 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:42,280 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:45,711 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:49,145 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:49,145 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2327, 'learning_rate': 0.00012017341040462427, 'epoch': 3.46} [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2735, 'learning_rate': 0.00011999999999999999, 'epoch': 3.46} [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2421, 'learning_rate': 0.00011982658959537571, 'epoch': 3.46} [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:49:52,678 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2077, 'learning_rate': 0.00011947976878612715, 'epoch': 3.46} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1836, 'learning_rate': 0.00011930635838150289, 'epoch': 3.47} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2034, 'learning_rate': 0.0001191329479768786, 'epoch': 3.47} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1808, 'learning_rate': 0.00011895953757225433, 'epoch': 3.47} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1652, 'learning_rate': 0.00011878612716763005, 'epoch': 3.47} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1875, 'learning_rate': 0.00011861271676300578, 'epoch': 3.48} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.146, 'learning_rate': 0.00011843930635838149, 'epoch': 3.48} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1162, 'learning_rate': 0.00011826589595375722, 'epoch': 3.48} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.17, 'learning_rate': 0.00011809248554913293, 'epoch': 3.48} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1839, 'learning_rate': 0.00011791907514450866, 'epoch': 3.48} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1617, 'learning_rate': 0.00011774566473988439, 'epoch': 3.49} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1352, 'learning_rate': 0.00011757225433526012, 'epoch': 3.49} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1274, 'learning_rate': 0.00011739884393063583, 'epoch': 3.49} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1727, 'learning_rate': 0.00011722543352601156, 'epoch': 3.49} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1377, 'learning_rate': 0.00011705202312138727, 'epoch': 3.5} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1351, 'learning_rate': 0.000116878612716763, 'epoch': 3.5} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1326, 'learning_rate': 0.00011670520231213872, 'epoch': 3.5} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1246, 'learning_rate': 0.00011653179190751443, 'epoch': 3.5} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1424, 'learning_rate': 0.00011635838150289016, 'epoch': 3.5} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1224, 'learning_rate': 0.00011618497109826587, 'epoch': 3.51} 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 69%|███████████████████████████████████████████████████▉ | 1544/2230 [5:17:55<2:25:31, 12.73s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1188, 'learning_rate': 0.0001160115606936416, 'epoch': 3.51} [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:01,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1331, 'learning_rate': 0.00011583815028901733, 'epoch': 3.51} 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1422, 'learning_rate': 0.00011566473988439306, 'epoch': 3.51} 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1566/2230 [5:22:36<2:14:20, 12.14s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1568/2230 [5:22:59<2:10:10, 11.80s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1568/2230 [5:22:59<2:10:10, 11.80s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1568/2230 [5:22:59<2:10:10, 11.80s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 70%|████████████████████████████████████████████████████▋ | 1568/2230 [5:22:59<2:10:10, 11.80s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1677, 'learning_rate': 0.0001153179190751445, 'epoch': 3.52} [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:55:44,144 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1393, 'learning_rate': 0.00011514450867052021, 'epoch': 3.52} [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:00,420 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:00,420 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:00,420 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:06,283 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:06,283 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:06,283 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.154, 'learning_rate': 0.00011479768786127166, 'epoch': 3.52} [WARNING|modeling_utils.py:388] 2022-03-22 21:56:10,344 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:24,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:24,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:24,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:24,457 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|████████████████████████████████████████████████████▉ | 1573/2230 [5:23:53<1:59:48, 10.94s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|████████████████████████████████████████████████████▉ | 1573/2230 [5:23:53<1:59:48, 10.94s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|████████████████████████████████████████████████████▉ | 1573/2230 [5:23:53<1:59:48, 10.94s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:36,691 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:36,691 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:36,691 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|████████████████████████████████████████████████████▉ | 1574/2230 [5:24:03<1:57:36, 10.76s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|████████████████████████████████████████████████████▉ | 1574/2230 [5:24:03<1:57:36, 10.76s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:45,152 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:45,152 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:45,152 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:45,152 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:56:45,152 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1106, 'learning_rate': 0.00011427745664739884, 'epoch': 3.53} [WARNING|modeling_utils.py:388] 2022-03-22 21:56:54,979 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:54,979 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:56:54,979 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:01,010 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:01,010 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1191, 'learning_rate': 0.00011410404624277455, 'epoch': 3.53} [WARNING|modeling_utils.py:388] 2022-03-22 21:57:01,010 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:07,113 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:07,113 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:57:11,363 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:57:11,363 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1119, 'learning_rate': 0.00011393063583815028, 'epoch': 3.54} [WARNING|modeling_utils.py:388] 2022-03-22 21:57:15,348 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:15,348 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 21:57:19,451 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████ | 1578/2230 [5:24:44<1:48:50, 10.02s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████ | 1578/2230 [5:24:44<1:48:50, 10.02s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:23,272 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:25,389 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:27,488 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:29,553 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:29,553 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:31,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:33,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:35,727 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:37,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:37,689 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:39,628 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:41,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:43,383 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:45,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:45,218 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:47,132 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:48,864 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:50,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:50,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:53,955 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:55,525 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:57,051 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:57,051 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:57:59,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:01,306 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:03,932 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:03,932 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:05,287 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:07,679 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:07,679 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:10,035 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:12,150 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:14,176 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:14,176 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:16,036 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:16,036 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:17,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:20,217 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:21,750 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:21,750 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1495, 'learning_rate': 0.00011202312138728322, 'epoch': 3.56} [WARNING|modeling_utils.py:388] 2022-03-22 21:58:25,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:25,542 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:29,146 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:29,146 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:32,692 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:36,250 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:36,250 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.318, 'learning_rate': 0.00011184971098265896, 'epoch': 3.56} [WARNING|modeling_utils.py:388] 2022-03-22 21:58:39,845 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:39,845 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:43,405 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:43,405 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:46,874 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:50,355 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:50,355 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.249, 'learning_rate': 0.00011167630057803466, 'epoch': 3.57} [WARNING|modeling_utils.py:388] 2022-03-22 21:58:53,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:53,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:58:57,464 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:00,903 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:00,903 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:04,388 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:04,388 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2753, 'learning_rate': 0.0001115028901734104, 'epoch': 3.57} [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1983, 'learning_rate': 0.00011132947976878612, 'epoch': 3.57} [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 21:59:07,941 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2503, 'learning_rate': 0.00011115606936416184, 'epoch': 3.57} 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2084, 'learning_rate': 0.00011098265895953756, 'epoch': 3.57} 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2058, 'learning_rate': 0.0001108092485549133, 'epoch': 3.58} 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 71%|█████████████████████████████████████████████████████▌ | 1593/2230 [5:26:55<2:11:28, 12.38s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1842, 'learning_rate': 0.000110635838150289, 'epoch': 3.58} 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1766, 'learning_rate': 0.00011046242774566474, 'epoch': 3.58} 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1596/2230 [5:27:35<2:18:41, 13.13s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.175, 'learning_rate': 0.00011011560693641618, 'epoch': 3.59} 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1449, 'learning_rate': 0.0001099421965317919, 'epoch': 3.59} 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1336, 'learning_rate': 0.00010976878612716762, 'epoch': 3.59} 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1672, 'learning_rate': 0.00010959537572254334, 'epoch': 3.59} 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1915, 'learning_rate': 0.00010942196531791907, 'epoch': 3.59} 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▋ | 1598/2230 [5:28:02<2:18:41, 13.17s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1526, 'learning_rate': 0.00010924855491329478, 'epoch': 3.6} 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1604/2230 [5:29:21<2:16:20, 13.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1468, 'learning_rate': 0.00010907514450867051, 'epoch': 3.6} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1434, 'learning_rate': 0.00010890173410404623, 'epoch': 3.6} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1468, 'learning_rate': 0.00010872832369942196, 'epoch': 3.6} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.126, 'learning_rate': 0.00010855491329479768, 'epoch': 3.61} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1342, 'learning_rate': 0.00010838150289017341, 'epoch': 3.61} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1391, 'learning_rate': 0.00010820809248554912, 'epoch': 3.61} 72%|█████████████████████████████████████████████████████▉ | 1605/2230 [5:29:34<2:15:06, 12.97s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1658, 'learning_rate': 0.00010803468208092485, 'epoch': 3.61} [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:03:20,931 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1385, 'learning_rate': 0.00010786127167630056, 'epoch': 3.61} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.156, 'learning_rate': 0.00010768786127167629, 'epoch': 3.62} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1532, 'learning_rate': 0.00010751445086705201, 'epoch': 3.62} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1125, 'learning_rate': 0.00010734104046242773, 'epoch': 3.62} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1232, 'learning_rate': 0.00010716763005780346, 'epoch': 3.62} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1338, 'learning_rate': 0.00010699421965317919, 'epoch': 3.63} 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 72%|██████████████████████████████████████████████████████▏ | 1612/2230 [5:31:00<2:06:20, 12.27s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1572, 'learning_rate': 0.0001068208092485549, 'epoch': 3.63} [WARNING|modeling_bart.py:1051] 2022-03-22 22:04:53,074 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:04:53,074 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:04:53,074 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:04:53,074 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▍ | 1619/2230 [5:32:24<1:58:17, 11.62s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▍ | 1619/2230 [5:32:24<1:58:17, 11.62s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1134, 'learning_rate': 0.00010664739884393063, 'epoch': 3.63} [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1702, 'learning_rate': 0.00010647398843930635, 'epoch': 3.63} [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:04,916 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1621/2230 [5:32:46<1:55:15, 11.36s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1621/2230 [5:32:46<1:55:15, 11.36s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1197, 'learning_rate': 0.00010630057803468207, 'epoch': 3.63} 73%|██████████████████████████████████████████████████████▌ | 1621/2230 [5:32:46<1:55:15, 11.36s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:29,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:29,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1265, 'learning_rate': 0.00010612716763005779, 'epoch': 3.64} 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1377, 'learning_rate': 0.00010595375722543353, 'epoch': 3.64} 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▌ | 1622/2230 [5:32:57<1:52:57, 11.15s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:51,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:51,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:51,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:05:51,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1339, 'learning_rate': 0.00010578034682080923, 'epoch': 3.64} [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1696, 'learning_rate': 0.00010560693641618497, 'epoch': 3.64} [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:00,407 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:12,610 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:12,610 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▋ | 1626/2230 [5:33:39<1:48:08, 10.74s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 73%|██████████████████████████████████████████████████████▋ | 1626/2230 [5:33:39<1:48:08, 10.74s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:18,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:18,685 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:22,198 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:24,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:24,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:24,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1684, 'learning_rate': 0.00010526011560693641, 'epoch': 3.65} [WARNING|modeling_utils.py:388] 2022-03-22 22:06:30,260 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:32,494 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:34,697 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:06:34,697 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1417, 'learning_rate': 0.00010508670520231213, 'epoch': 3.65} [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:38,737 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:40,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:42,891 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:45,061 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:45,061 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:47,118 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:49,107 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:51,076 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:53,051 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:53,051 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:54,910 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:56,741 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:58,514 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:06:58,514 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:00,375 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:02,087 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:05,386 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:07,093 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:07,093 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:08,623 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:11,667 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:13,198 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:13,198 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:14,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:17,263 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:17,263 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:19,843 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:21,031 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:23,405 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:23,405 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:25,557 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:27,647 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:27,647 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:29,552 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:31,452 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:31,452 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:32,289 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:35,314 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:35,314 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2108, 'learning_rate': 0.00010335260115606935, 'epoch': 3.67} [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:38,628 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:38,628 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:42,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:45,774 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:45,774 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:49,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:49,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3022, 'learning_rate': 0.00010317919075144509, 'epoch': 3.67} [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:52,969 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:52,969 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:56,454 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:59,984 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:59,984 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:07:59,984 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:03,423 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:03,423 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:06,990 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:10,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:10,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:13,932 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:13,932 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:13,932 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:17,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:17,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:20,921 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2243, 'learning_rate': 0.00010265895953757225, 'epoch': 3.68} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1595, 'learning_rate': 0.00010248554913294798, 'epoch': 3.68} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.229, 'learning_rate': 0.00010231213872832369, 'epoch': 3.69} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1601, 'learning_rate': 0.00010213872832369942, 'epoch': 3.69} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.176, 'learning_rate': 0.00010196531791907513, 'epoch': 3.69} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1696, 'learning_rate': 0.00010179190751445086, 'epoch': 3.69} [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:08:24,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2211, 'learning_rate': 0.00010161849710982658, 'epoch': 3.7} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2021, 'learning_rate': 0.0001014450867052023, 'epoch': 3.7} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1745, 'learning_rate': 0.00010127167630057803, 'epoch': 3.7} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1634, 'learning_rate': 0.00010109826589595376, 'epoch': 3.7} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1452, 'learning_rate': 0.00010092485549132947, 'epoch': 3.7} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1467, 'learning_rate': 0.0001007514450867052, 'epoch': 3.71} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1406, 'learning_rate': 0.00010057803468208092, 'epoch': 3.71} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1217, 'learning_rate': 0.00010040462427745664, 'epoch': 3.71} 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▍ | 1648/2230 [5:37:15<2:08:00, 13.20s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1009, 'learning_rate': 0.00010023121387283236, 'epoch': 3.71} 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1656/2230 [5:39:00<2:02:50, 12.84s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1508, 'learning_rate': 0.0001000578034682081, 'epoch': 3.72} 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▋ | 1657/2230 [5:39:13<2:01:38, 12.74s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.138, 'learning_rate': 9.98843930635838e-05, 'epoch': 3.72} 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1658/2230 [5:39:25<2:00:37, 12.65s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1139, 'learning_rate': 9.971098265895953e-05, 'epoch': 3.72} 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1659/2230 [5:39:37<1:59:25, 12.55s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1276, 'learning_rate': 9.953757225433525e-05, 'epoch': 3.72} 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1660/2230 [5:39:50<1:58:19, 12.46s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0977, 'learning_rate': 9.936416184971097e-05, 'epoch': 3.72} 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 74%|███████████████████████████████████████████████████████▊ | 1661/2230 [5:40:02<1:57:24, 12.38s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|███████████████████████████████████████████████████████▉ | 1662/2230 [5:40:14<1:56:15, 12.28s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|███████████████████████████████████████████████████████▉ | 1662/2230 [5:40:14<1:56:15, 12.28s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|███████████████████████████████████████████████████████▉ | 1662/2230 [5:40:14<1:56:15, 12.28s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|███████████████████████████████████████████████████████▉ | 1662/2230 [5:40:14<1:56:15, 12.28s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|███████████████████████████████████████████████████████▉ | 1662/2230 [5:40:14<1:56:15, 12.28s/it] Setting `use_cache=False`...e computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.096, 'learning_rate': 9.901734104046241e-05, 'epoch': 3.73} [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.135, 'learning_rate': 9.884393063583814e-05, 'epoch': 3.73} [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1351, 'learning_rate': 9.867052023121385e-05, 'epoch': 3.73} [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:13:01,659 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1614, 'learning_rate': 9.83236994219653e-05, 'epoch': 3.74} 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1256, 'learning_rate': 9.815028901734104e-05, 'epoch': 3.74} 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████ | 1666/2230 [5:41:03<1:53:25, 12.07s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.128, 'learning_rate': 9.780346820809248e-05, 'epoch': 3.74} 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1669/2230 [5:41:37<1:49:18, 11.69s/it]g-point operations will not be computed-22 21:27:35,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1671/2230 [5:42:00<1:46:40, 11.45s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1671/2230 [5:42:00<1:46:40, 11.45s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0963, 'learning_rate': 9.763005780346819e-05, 'epoch': 3.75} 75%|████████████████████████████████████████████████████████▏ | 1671/2230 [5:42:00<1:46:40, 11.45s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1671/2230 [5:42:00<1:46:40, 11.45s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 75%|████████████████████████████████████████████████████████▏ | 1671/2230 [5:42:00<1:46:40, 11.45s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1002, 'learning_rate': 9.745664739884392e-05, 'epoch': 3.75} [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1219, 'learning_rate': 9.728323699421964e-05, 'epoch': 3.75} [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:14:47,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1512, 'learning_rate': 9.693641618497108e-05, 'epoch': 3.76} [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:10,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1085, 'learning_rate': 9.676300578034682e-05, 'epoch': 3.76} [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:34,421 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:34,421 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:38,478 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:38,478 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:38,478 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:42,813 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:42,813 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:46,715 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:48,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:48,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:48,980 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:53,095 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:55,253 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:15:55,253 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:58,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:15:58,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:00,974 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:03,015 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:04,955 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:06,912 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:06,912 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:08,887 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:10,775 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:12,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:14,424 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:14,424 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:16,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:18,105 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:19,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:19,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:23,183 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:24,748 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:26,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:26,274 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:29,277 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:30,637 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:33,230 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:33,230 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:34,562 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:36,944 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:36,944 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:39,283 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:41,356 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:41,356 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:43,392 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:45,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:45,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:47,025 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:49,246 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:50,801 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:50,801 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2098, 'learning_rate': 9.468208092485548e-05, 'epoch': 3.78} [WARNING|modeling_utils.py:388] 2022-03-22 22:16:54,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:54,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:58,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:16:58,379 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:01,973 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:05,500 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:05,500 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3423, 'learning_rate': 9.45086705202312e-05, 'epoch': 3.79} [WARNING|modeling_utils.py:388] 2022-03-22 22:17:09,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:09,100 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:12,633 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:16,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:16,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:19,572 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:19,572 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2938, 'learning_rate': 9.433526011560693e-05, 'epoch': 3.79} [WARNING|modeling_utils.py:388] 2022-03-22 22:17:23,095 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:23,095 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:26,556 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:30,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:30,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:33,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:33,374 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2029, 'learning_rate': 9.416184971098264e-05, 'epoch': 3.79} [WARNING|modeling_utils.py:388] 2022-03-22 22:17:36,899 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.211, 'learning_rate': 9.398843930635838e-05, 'epoch': 3.79} [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:17:40,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1836, 'learning_rate': 9.38150289017341e-05, 'epoch': 3.8} 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1827, 'learning_rate': 9.364161849710982e-05, 'epoch': 3.8} 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2368, 'learning_rate': 9.346820809248554e-05, 'epoch': 3.8} 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1681, 'learning_rate': 9.329479768786127e-05, 'epoch': 3.8} 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|████████████████████████████████████████████████████████▉ | 1693/2230 [5:45:24<1:50:42, 12.37s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1423, 'learning_rate': 9.312138728323698e-05, 'epoch': 3.8} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1687, 'learning_rate': 9.294797687861271e-05, 'epoch': 3.81} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1647, 'learning_rate': 9.277456647398842e-05, 'epoch': 3.81} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1315, 'learning_rate': 9.260115606936415e-05, 'epoch': 3.81} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1734, 'learning_rate': 9.242774566473988e-05, 'epoch': 3.81} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1152, 'learning_rate': 9.22543352601156e-05, 'epoch': 3.82} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1555, 'learning_rate': 9.208092485549132e-05, 'epoch': 3.82} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.102, 'learning_rate': 9.190751445086705e-05, 'epoch': 3.82} 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████ | 1697/2230 [5:46:17<1:56:06, 13.07s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1814, 'learning_rate': 9.173410404624276e-05, 'epoch': 3.82} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1371, 'learning_rate': 9.156069364161849e-05, 'epoch': 3.83} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.136, 'learning_rate': 9.138728323699421e-05, 'epoch': 3.83} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1517, 'learning_rate': 9.121387283236993e-05, 'epoch': 3.83} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1427, 'learning_rate': 9.104046242774565e-05, 'epoch': 3.83} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1287, 'learning_rate': 9.086705202312139e-05, 'epoch': 3.83} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1701, 'learning_rate': 9.06936416184971e-05, 'epoch': 3.84} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1062, 'learning_rate': 9.052023121387283e-05, 'epoch': 3.84} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1352, 'learning_rate': 9.034682080924854e-05, 'epoch': 3.84} 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 76%|█████████████████████████████████████████████████████████▎ | 1705/2230 [5:48:02<1:52:43, 12.88s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1358, 'learning_rate': 9.017341040462427e-05, 'epoch': 3.84} [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1026, 'learning_rate': 8.999999999999999e-05, 'epoch': 3.85} [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1243, 'learning_rate': 8.982658959537573e-05, 'epoch': 3.85} [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:22:29,631 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 77%|█████████████████████████████████████████████████████████▋ | 1717/2230 [5:50:29<1:41:56, 11.92s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 77%|█████████████████████████████████████████████████████████▋ | 1717/2230 [5:50:29<1:41:56, 11.92s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0982, 'learning_rate': 8.965317919075143e-05, 'epoch': 3.85} 77%|█████████████████████████████████████████████████████████▋ | 1717/2230 [5:50:29<1:41:56, 11.92s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 77%|█████████████████████████████████████████████████████████▋ | 1717/2230 [5:50:29<1:41:56, 11.92s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1623, 'learning_rate': 8.947976878612717e-05, 'epoch': 3.85} [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:23:14,755 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 77%|█████████████████████████████████████████████████████████▊ | 1719/2230 [5:50:52<1:38:52, 11.61s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 77%|█████████████████████████████████████████████████████████▊ | 1719/2230 [5:50:52<1:38:52, 11.61s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1325, 'learning_rate': 8.930635838150287e-05, 'epoch': 3.85} 77%|█████████████████████████████████████████████████████████▊ | 1719/2230 [5:50:52<1:38:52, 11.61s/it]g-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:35,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:35,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:35,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:35,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:35,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1091, 'learning_rate': 8.895953757225433e-05, 'epoch': 3.86} [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:23:43,564 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:01,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:01,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:01,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:01,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:01,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:12,114 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:12,114 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1028, 'learning_rate': 8.861271676300577e-05, 'epoch': 3.86} [WARNING|modeling_utils.py:388] 2022-03-22 22:24:16,122 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:16,122 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:16,122 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:16,122 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:16,122 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0774, 'learning_rate': 8.84393063583815e-05, 'epoch': 3.87} [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1442, 'learning_rate': 8.826589595375721e-05, 'epoch': 3.87} [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:26,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:44,256 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:44,256 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:24:48,683 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:24:48,683 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:52,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:52,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1029, 'learning_rate': 8.791907514450865e-05, 'epoch': 3.87} [WARNING|modeling_utils.py:388] 2022-03-22 22:24:52,675 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:24:58,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:25:00,758 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:25:02,925 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:25:02,925 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1056, 'learning_rate': 8.774566473988439e-05, 'epoch': 3.87} [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:06,872 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:09,017 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:11,139 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:11,139 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:13,225 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:15,207 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:17,137 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:19,004 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:19,004 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:20,951 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:22,802 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:24,608 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:26,413 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:26,413 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:28,206 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:31,607 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:33,251 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:33,251 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:34,915 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:37,911 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:39,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:39,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:40,766 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:43,448 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:46,026 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:46,026 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:47,204 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:49,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:49,495 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:51,678 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:53,613 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:53,613 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:55,616 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:57,385 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:57,385 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:25:59,088 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:01,982 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:01,982 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1612, 'learning_rate': 8.601156069364162e-05, 'epoch': 3.9} [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:05,314 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:05,314 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:09,005 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:12,557 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:12,557 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:16,037 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:16,037 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.3042, 'learning_rate': 8.583815028901733e-05, 'epoch': 3.9} [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:19,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:19,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:23,102 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:26,585 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:26,585 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:30,007 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:30,007 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2292, 'learning_rate': 8.566473988439306e-05, 'epoch': 3.9} [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:33,524 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:36,985 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:36,985 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:40,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:40,370 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:43,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:43,764 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.208, 'learning_rate': 8.549132947976878e-05, 'epoch': 3.9} [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:47,217 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:50,567 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:50,567 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:53,961 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2278, 'learning_rate': 8.53179190751445e-05, 'epoch': 3.91} [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:26:57,318 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1668, 'learning_rate': 8.497109826589596e-05, 'epoch': 3.91} 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▌ | 1743/2230 [5:54:34<1:39:12, 12.22s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1466, 'learning_rate': 8.479768786127167e-05, 'epoch': 3.91} 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.15, 'learning_rate': 8.46242774566474e-05, 'epoch': 3.91} 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1258, 'learning_rate': 8.445086705202311e-05, 'epoch': 3.92} 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1983, 'learning_rate': 8.427745664739884e-05, 'epoch': 3.92} 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▋ | 1745/2230 [5:55:01<1:43:45, 12.84s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1321, 'learning_rate': 8.410404624277456e-05, 'epoch': 3.92} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1364, 'learning_rate': 8.393063583815028e-05, 'epoch': 3.92} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.161, 'learning_rate': 8.3757225433526e-05, 'epoch': 3.93} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1493, 'learning_rate': 8.358381502890174e-05, 'epoch': 3.93} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1414, 'learning_rate': 8.341040462427745e-05, 'epoch': 3.93} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1325, 'learning_rate': 8.323699421965317e-05, 'epoch': 3.93} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1344, 'learning_rate': 8.30635838150289e-05, 'epoch': 3.93} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1742, 'learning_rate': 8.289017341040461e-05, 'epoch': 3.94} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1388, 'learning_rate': 8.271676300578034e-05, 'epoch': 3.94} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.101, 'learning_rate': 8.254335260115605e-05, 'epoch': 3.94} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1336, 'learning_rate': 8.236994219653178e-05, 'epoch': 3.94} 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 78%|██████████████████████████████████████████████████████████▊ | 1749/2230 [5:55:54<1:44:37, 13.05s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1235, 'learning_rate': 8.21965317919075e-05, 'epoch': 3.95} 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1760/2230 [5:58:14<1:36:23, 12.31s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1327, 'learning_rate': 8.202312138728322e-05, 'epoch': 3.95} 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1269, 'learning_rate': 8.184971098265895e-05, 'epoch': 3.95} 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0936, 'learning_rate': 8.167630057803468e-05, 'epoch': 3.95} 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▏ | 1761/2230 [5:58:26<1:35:52, 12.27s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1567, 'learning_rate': 8.150289017341039e-05, 'epoch': 3.96} 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1126, 'learning_rate': 8.132947976878612e-05, 'epoch': 3.96} 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1776, 'learning_rate': 8.115606936416184e-05, 'epoch': 3.96} 79%|███████████████████████████████████████████████████████████▎ | 1764/2230 [5:59:03<1:35:49, 12.34s/it] Setting `use_cache=False`...e computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:08,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:08,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:08,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:14:37,851 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1237, 'learning_rate': 8.098265895953756e-05, 'epoch': 3.96} 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1129, 'learning_rate': 8.080924855491328e-05, 'epoch': 3.96} 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▍ | 1767/2230 [5:59:38<1:30:31, 11.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:36,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:36,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1201, 'learning_rate': 8.063583815028902e-05, 'epoch': 3.97} [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:36,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:36,249 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:44,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:44,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:32:44,306 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▌ | 1770/2230 [6:00:10<1:25:41, 11.18s/it] Setting `use_cache=False`...1] 2022-03-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:50,116 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:50,116 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:54,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:32:54,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▌ | 1771/2230 [6:00:21<1:24:05, 10.99s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▌ | 1771/2230 [6:00:21<1:24:05, 10.99s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1233, 'learning_rate': 8.028901734104046e-05, 'epoch': 3.97} 79%|███████████████████████████████████████████████████████████▌ | 1771/2230 [6:00:21<1:24:05, 10.99s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:04,349 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:04,349 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▌ | 1772/2230 [6:00:31<1:22:01, 10.75s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 79%|███████████████████████████████████████████████████████████▌ | 1772/2230 [6:00:31<1:22:01, 10.75s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:10,669 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:10,669 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:10,669 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:16,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:16,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:16,737 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1067, 'learning_rate': 7.99421965317919e-05, 'epoch': 3.98} [WARNING|modeling_utils.py:388] 2022-03-22 22:33:22,906 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:25,238 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:25,238 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:25,238 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1084, 'learning_rate': 7.976878612716762e-05, 'epoch': 3.98} [WARNING|modeling_utils.py:388] 2022-03-22 22:33:30,993 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:33,202 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:35,352 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:35,352 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:35,352 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1326, 'learning_rate': 7.959537572254334e-05, 'epoch': 3.98} [WARNING|modeling_utils.py:388] 2022-03-22 22:33:41,372 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:43,420 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:45,433 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:47,463 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:47,463 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:49,370 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:51,287 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:53,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:53,099 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:54,993 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:56,756 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:33:58,462 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:00,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:00,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:03,390 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:04,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:06,431 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:06,431 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:09,331 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:10,702 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:13,414 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:13,414 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:15,917 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:17,091 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:17,091 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:19,373 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:21,419 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:21,419 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:23,342 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:26,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:26,083 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:27,720 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:28,464 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:28,464 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:30,943 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:30,943 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:34,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:34,606 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:38,232 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:41,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:41,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:41,842 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:45,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:45,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:49,019 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:49,019 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:52,540 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:56,055 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:56,055 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:56,055 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:59,586 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:34:59,586 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:03,028 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:06,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:06,529 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:09,984 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:09,984 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1435, 'learning_rate': 7.751445086705202e-05, 'epoch': 4.01} [WARNING|modeling_utils.py:388] 2022-03-22 22:35:13,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:13,587 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:17,027 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1179, 'learning_rate': 7.734104046242774e-05, 'epoch': 4.01} [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1074, 'learning_rate': 7.716763005780346e-05, 'epoch': 4.01} [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1491, 'learning_rate': 7.699421965317918e-05, 'epoch': 4.01} [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1194, 'learning_rate': 7.682080924855491e-05, 'epoch': 4.02} [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:35:22,492 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.108, 'learning_rate': 7.664739884393062e-05, 'epoch': 4.02} g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1506, 'learning_rate': 7.647398843930635e-05, 'epoch': 4.02} g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1111, 'learning_rate': 7.630057803468207e-05, 'epoch': 4.02} g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.079, 'learning_rate': 7.61271676300578e-05, 'epoch': 4.02} g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1138, 'learning_rate': 7.595375722543352e-05, 'epoch': 4.03} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:17,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:17,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:17,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:17,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:17,351 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▍ | 1797/2230 [6:04:50<1:34:07, 13.04s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▍ | 1797/2230 [6:04:50<1:34:07, 13.04s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0851, 'learning_rate': 7.578034682080925e-05, 'epoch': 4.03} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0961, 'learning_rate': 7.560693641618496e-05, 'epoch': 4.03} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:31,870 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0884, 'learning_rate': 7.543352601156069e-05, 'epoch': 4.03} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1127, 'learning_rate': 7.52601156069364e-05, 'epoch': 4.04} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0897, 'learning_rate': 7.508670520231213e-05, 'epoch': 4.04} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0773, 'learning_rate': 7.491329479768785e-05, 'epoch': 4.04} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0915, 'learning_rate': 7.473988439306357e-05, 'epoch': 4.04} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.068, 'learning_rate': 7.45664739884393e-05, 'epoch': 4.04} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1086, 'learning_rate': 7.439306358381502e-05, 'epoch': 4.05} [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:37:46,313 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▋ | 1806/2230 [6:06:45<1:28:31, 12.53s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0761, 'learning_rate': 7.404624277456646e-05, 'epoch': 4.05} 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0756, 'learning_rate': 7.387283236994219e-05, 'epoch': 4.05} 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0843, 'learning_rate': 7.369942196531791e-05, 'epoch': 4.06} 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0606, 'learning_rate': 7.352601156069363e-05, 'epoch': 4.06} 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▊ | 1807/2230 [6:06:57<1:27:34, 12.42s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0804, 'learning_rate': 7.335260115606935e-05, 'epoch': 4.06} [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:21,674 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0911, 'learning_rate': 7.317919075144507e-05, 'epoch': 4.06} 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0853, 'learning_rate': 7.30057803468208e-05, 'epoch': 4.07} 81%|████████████████████████████████████████████████████████████▉ | 1812/2230 [6:07:56<1:23:02, 11.92s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:54,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:54,156 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0711, 'learning_rate': 7.283236994219653e-05, 'epoch': 4.07} [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0891, 'learning_rate': 7.265895953757225e-05, 'epoch': 4.07} [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:40:58,380 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0727, 'learning_rate': 7.248554913294797e-05, 'epoch': 4.07} [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:20,765 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:41:35,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:41:35,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:41:35,409 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:41,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:41,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:41,162 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:45,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:45,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:45,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:45,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1819/2230 [6:09:16<1:15:11, 10.98s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1819/2230 [6:09:16<1:15:11, 10.98s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:55,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:55,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:55,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:55,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:41:55,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1820/2230 [6:09:26<1:13:38, 10.78s/it]g-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:05,793 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:05,793 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:05,793 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:05,793 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:05,793 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:32:15,482 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1821/2230 [6:09:36<1:12:12, 10.59s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1821/2230 [6:09:36<1:12:12, 10.59s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 82%|█████████████████████████████████████████████████████████████▏ | 1821/2230 [6:09:36<1:12:12, 10.59s/it][WARNING|modeling_bart.py:1051] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:42:20,350 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:42:20,350 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:42:20,350 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0596, 'learning_rate': 7.144508670520231e-05, 'epoch': 4.09} [WARNING|modeling_bart.py:1051] 2022-03-22 22:42:26,364 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:42:26,364 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:30,324 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:32,616 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:32,616 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0784, 'learning_rate': 7.127167630057803e-05, 'epoch': 4.09} [WARNING|modeling_utils.py:388] 2022-03-22 22:42:32,616 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:38,332 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:40,480 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:40,480 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:42,703 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:44,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:46,898 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:49,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:49,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:49,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:49,000 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:54,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:56,853 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:58,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:42:58,763 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:00,702 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:02,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:04,316 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:06,087 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:06,087 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:07,893 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:09,581 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:12,852 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:12,852 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:14,560 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:16,111 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:19,064 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:19,064 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:20,530 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:23,258 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:24,565 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:24,565 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:27,189 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:28,384 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:28,384 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:30,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:32,862 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:34,908 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:34,908 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:36,717 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:36,717 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:39,363 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:40,884 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:40,884 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:43,141 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:43,141 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:46,800 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:46,800 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:50,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:50,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:53,929 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:43:53,929 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1786, 'learning_rate': 6.91907514450867e-05, 'epoch': 4.11} [WARNING|modeling_utils.py:388] 2022-03-22 22:43:57,551 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:01,091 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:01,091 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:04,523 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:04,523 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:08,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:08,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:11,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:11,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:15,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:15,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:18,551 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:22,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:22,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:22,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:25,544 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:25,544 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:28,913 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:34,340 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:34,340 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1287, 'learning_rate': 6.867052023121387e-05, 'epoch': 4.12} [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1288, 'learning_rate': 6.849710982658959e-05, 'epoch': 4.12} [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1246, 'learning_rate': 6.832369942196531e-05, 'epoch': 4.13} [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1106, 'learning_rate': 6.815028901734103e-05, 'epoch': 4.13} [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:44:37,739 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0932, 'learning_rate': 6.780346820809248e-05, 'epoch': 4.13} 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|█████████████████████████████████████████████████████████████▉ | 1842/2230 [6:12:56<1:24:50, 13.12s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0964, 'learning_rate': 6.76300578034682e-05, 'epoch': 4.13} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0843, 'learning_rate': 6.745664739884392e-05, 'epoch': 4.14} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0865, 'learning_rate': 6.728323699421964e-05, 'epoch': 4.14} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0682, 'learning_rate': 6.710982658959537e-05, 'epoch': 4.14} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0936, 'learning_rate': 6.693641618497109e-05, 'epoch': 4.14} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0974, 'learning_rate': 6.676300578034682e-05, 'epoch': 4.15} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.077, 'learning_rate': 6.658959537572254e-05, 'epoch': 4.15} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0933, 'learning_rate': 6.641618497109826e-05, 'epoch': 4.15} 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████ | 1844/2230 [6:13:23<1:25:17, 13.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0889, 'learning_rate': 6.624277456647398e-05, 'epoch': 4.15} 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1852/2230 [6:15:07<1:21:41, 12.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1853/2230 [6:15:20<1:20:23, 12.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1003, 'learning_rate': 6.589595375722542e-05, 'epoch': 4.16} 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▎ | 1854/2230 [6:15:32<1:19:28, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1036, 'learning_rate': 6.572254335260114e-05, 'epoch': 4.16} 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1855/2230 [6:15:45<1:18:44, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0973, 'learning_rate': 6.554913294797688e-05, 'epoch': 4.16} 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1856/2230 [6:15:57<1:17:47, 12.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0868, 'learning_rate': 6.53757225433526e-05, 'epoch': 4.16} 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1857/2230 [6:16:09<1:16:57, 12.38s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0673, 'learning_rate': 6.520231213872832e-05, 'epoch': 4.17} 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0833, 'learning_rate': 6.502890173410404e-05, 'epoch': 4.17} 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0597, 'learning_rate': 6.485549132947976e-05, 'epoch': 4.17} 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▍ | 1858/2230 [6:16:21<1:16:01, 12.26s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0666, 'learning_rate': 6.468208092485548e-05, 'epoch': 4.17} 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1861/2230 [6:16:56<1:13:37, 11.97s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0615, 'learning_rate': 6.45086705202312e-05, 'epoch': 4.17} 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0832, 'learning_rate': 6.433526011560694e-05, 'epoch': 4.18} 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 83%|██████████████████████████████████████████████████████████████▌ | 1862/2230 [6:17:08<1:12:42, 11.85s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▋ | 1864/2230 [6:17:33<1:13:29, 12.05s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▋ | 1864/2230 [6:17:33<1:13:29, 12.05s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0922, 'learning_rate': 6.416184971098266e-05, 'epoch': 4.18} 84%|██████████████████████████████████████████████████████████████▋ | 1864/2230 [6:17:33<1:13:29, 12.05s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▋ | 1864/2230 [6:17:33<1:13:29, 12.05s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0847, 'learning_rate': 6.398843930635838e-05, 'epoch': 4.18} [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:50:18,269 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.067, 'learning_rate': 6.38150289017341e-05, 'epoch': 4.18} 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0962, 'learning_rate': 6.364161849710982e-05, 'epoch': 4.19} 84%|██████████████████████████████████████████████████████████████▊ | 1866/2230 [6:17:55<1:10:35, 11.63s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0805, 'learning_rate': 6.346820809248554e-05, 'epoch': 4.19} [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:50:50,925 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1869/2230 [6:18:28<1:06:26, 11.04s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▊ | 1869/2230 [6:18:28<1:06:26, 11.04s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0747, 'learning_rate': 6.329479768786126e-05, 'epoch': 4.19} [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:09,397 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:09,397 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:09,397 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▉ | 1870/2230 [6:18:38<1:05:06, 10.85s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▉ | 1870/2230 [6:18:38<1:05:06, 10.85s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0708, 'learning_rate': 6.3121387283237e-05, 'epoch': 4.19} 84%|██████████████████████████████████████████████████████████████▉ | 1870/2230 [6:18:38<1:05:06, 10.85s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▉ | 1870/2230 [6:18:38<1:05:06, 10.85s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▉ | 1870/2230 [6:18:38<1:05:06, 10.85s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:25,291 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:25,291 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0769, 'learning_rate': 6.294797687861272e-05, 'epoch': 4.2} [WARNING|modeling_utils.py:388] 2022-03-22 22:51:25,291 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:25,291 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:33,567 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:33,567 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 84%|██████████████████████████████████████████████████████████████▉ | 1872/2230 [6:18:58<1:02:24, 10.46s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:37,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:37,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:37,746 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:43,663 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:43,663 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1026, 'learning_rate': 6.260115606936416e-05, 'epoch': 4.2} [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:48,029 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:48,029 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:51,896 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:54,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:51:54,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.07, 'learning_rate': 6.242774566473988e-05, 'epoch': 4.2} [WARNING|modeling_bart.py:1051] 2022-03-22 22:51:58,237 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:00,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:00,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:00,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:00,354 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:06,397 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:08,419 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:10,375 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:12,348 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:12,348 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:14,344 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:16,197 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:18,030 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:19,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:19,808 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:23,316 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:23,316 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:26,700 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:26,700 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:28,400 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:29,962 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:33,016 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:33,016 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:34,510 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:37,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:38,542 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:38,542 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:41,184 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:42,379 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:42,379 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:44,730 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:46,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:46,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:48,878 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:50,724 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:50,724 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:52,579 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:54,891 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:54,891 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1007, 'learning_rate': 6.0693641618497105e-05, 'epoch': 4.22} [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:58,199 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:52:58,199 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:01,840 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:05,443 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:05,443 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:08,979 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:08,979 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.2018, 'learning_rate': 6.052023121387283e-05, 'epoch': 4.23} [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:12,582 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:12,582 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:16,130 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:19,623 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:19,623 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:23,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:23,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1629, 'learning_rate': 6.034682080924855e-05, 'epoch': 4.23} [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:26,665 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:26,665 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:30,076 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:33,511 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:33,511 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:36,934 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:36,934 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.122, 'learning_rate': 6.0173410404624274e-05, 'epoch': 4.23} [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:40,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:43,809 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:43,809 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:49,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1055, 'learning_rate': 5.9999999999999995e-05, 'epoch': 4.23} [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1079, 'learning_rate': 5.982658959537572e-05, 'epoch': 4.24} [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 22:53:52,517 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0997, 'learning_rate': 5.965317919075144e-05, 'epoch': 4.24} 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1041, 'learning_rate': 5.9479768786127164e-05, 'epoch': 4.24} 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▌ | 1890/2230 [6:21:43<1:13:14, 12.93s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1212, 'learning_rate': 5.930635838150289e-05, 'epoch': 4.24} 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0923, 'learning_rate': 5.913294797687861e-05, 'epoch': 4.24} 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1892/2230 [6:22:10<1:14:15, 13.18s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0925, 'learning_rate': 5.878612716763006e-05, 'epoch': 4.25} 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0758, 'learning_rate': 5.861271676300578e-05, 'epoch': 4.25} 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|███████████████████████████████████████████████████████████████▋ | 1894/2230 [6:22:37<1:13:55, 13.20s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1113, 'learning_rate': 5.84393063583815e-05, 'epoch': 4.25} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1055, 'learning_rate': 5.8265895953757215e-05, 'epoch': 4.26} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0753, 'learning_rate': 5.8092485549132936e-05, 'epoch': 4.26} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0915, 'learning_rate': 5.791907514450866e-05, 'epoch': 4.26} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0812, 'learning_rate': 5.7745664739884384e-05, 'epoch': 4.26} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0886, 'learning_rate': 5.7572254335260105e-05, 'epoch': 4.26} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0959, 'learning_rate': 5.739884393063583e-05, 'epoch': 4.27} [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:55:57,163 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0911, 'learning_rate': 5.722543352601155e-05, 'epoch': 4.27} 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1904/2230 [6:24:46<1:09:13, 12.74s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0982, 'learning_rate': 5.705202312138727e-05, 'epoch': 4.27} 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1004, 'learning_rate': 5.6878612716762994e-05, 'epoch': 4.27} 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 85%|████████████████████████████████████████████████████████████████ | 1905/2230 [6:24:58<1:08:27, 12.64s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0696, 'learning_rate': 5.670520231213872e-05, 'epoch': 4.28} 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1907/2230 [6:25:23<1:06:53, 12.43s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0809, 'learning_rate': 5.653179190751444e-05, 'epoch': 4.28} 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0669, 'learning_rate': 5.635838150289016e-05, 'epoch': 4.28} 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0628, 'learning_rate': 5.618497109826589e-05, 'epoch': 4.28} 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0613, 'learning_rate': 5.601156069364161e-05, 'epoch': 4.28} 86%|████████████████████████████████████████████████████████████████▏ | 1908/2230 [6:25:35<1:06:04, 12.31s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:58:53,180 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:58:53,180 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:58:53,180 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0931, 'learning_rate': 5.583815028901733e-05, 'epoch': 4.29} 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0753, 'learning_rate': 5.566473988439306e-05, 'epoch': 4.29} 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1912/2230 [6:26:22<1:03:03, 11.90s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1914/2230 [6:26:47<1:03:44, 12.10s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▎ | 1914/2230 [6:26:47<1:03:44, 12.10s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0692, 'learning_rate': 5.549132947976878e-05, 'epoch': 4.29} [WARNING|modeling_utils.py:388] 2022-03-22 22:59:28,121 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:28,121 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1025, 'learning_rate': 5.53179190751445e-05, 'epoch': 4.29} [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 22:59:32,319 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0699, 'learning_rate': 5.514450867052022e-05, 'epoch': 4.3} 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|████████████████████████████████████████████████████████████████▍ | 1916/2230 [6:27:09<1:01:02, 11.66s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.085, 'learning_rate': 5.497109826589595e-05, 'epoch': 4.3} [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0596, 'learning_rate': 5.479768786127167e-05, 'epoch': 4.3} [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:02,644 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1919/2230 [6:27:41<57:03, 11.01s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1919/2230 [6:27:41<57:03, 11.01s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0638, 'learning_rate': 5.462427745664739e-05, 'epoch': 4.3} 86%|██████████████████████████████████████████████████████████████████▎ | 1919/2230 [6:27:41<57:03, 11.01s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:24,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:24,967 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1920/2230 [6:27:52<55:46, 10.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1920/2230 [6:27:52<55:46, 10.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0701, 'learning_rate': 5.445086705202312e-05, 'epoch': 4.3} 86%|██████████████████████████████████████████████████████████████████▎ | 1920/2230 [6:27:52<55:46, 10.79s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:35,150 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:35,150 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1921/2230 [6:28:02<54:29, 10.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1921/2230 [6:28:02<54:29, 10.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:41,426 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:41,426 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:41,426 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:00:47,111 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1922/2230 [6:28:12<53:16, 10.38s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 86%|██████████████████████████████████████████████████████████████████▎ | 1922/2230 [6:28:12<53:16, 10.38s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:51,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:51,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:51,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:57,177 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:59,554 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:00:59,554 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.053, 'learning_rate': 5.393063583815028e-05, 'epoch': 4.31} [WARNING|modeling_bart.py:1051] 2022-03-22 23:01:03,705 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:01:03,705 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:07,500 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:07,500 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:09,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:11,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:11,969 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:01:15,756 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:01:15,756 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:01:15,756 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:19,963 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:21,943 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:23,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:23,881 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:25,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:27,761 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:29,619 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:31,413 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:31,413 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:33,168 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:34,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:36,641 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:36,641 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:39,850 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:41,473 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:42,989 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:42,989 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:45,838 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:47,251 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:49,812 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:49,812 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:52,261 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:53,390 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:53,390 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:55,552 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:57,728 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:57,728 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:01:59,660 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:02,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:02,487 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:04,260 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:06,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:06,568 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0967, 'learning_rate': 5.2023121387283234e-05, 'epoch': 4.34} [WARNING|modeling_utils.py:388] 2022-03-22 23:02:10,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:10,328 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:13,954 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:13,954 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:17,509 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:21,024 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:21,024 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1576, 'learning_rate': 5.1849710982658955e-05, 'epoch': 4.34} [WARNING|modeling_utils.py:388] 2022-03-22 23:02:24,619 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:24,619 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:28,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:28,069 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:31,574 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:35,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:35,046 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.144, 'learning_rate': 5.1676300578034675e-05, 'epoch': 4.34} [WARNING|modeling_utils.py:388] 2022-03-22 23:02:38,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:38,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:42,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:42,002 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:45,425 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:45,425 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:48,818 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:48,818 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:52,212 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:52,212 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:57,521 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:02:57,521 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1134, 'learning_rate': 5.1329479768786124e-05, 'epoch': 4.35} [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:03:00,892 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1297, 'learning_rate': 5.1156069364161844e-05, 'epoch': 4.35} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1048, 'learning_rate': 5.0982658959537565e-05, 'epoch': 4.35} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0935, 'learning_rate': 5.080924855491329e-05, 'epoch': 4.35} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1055, 'learning_rate': 5.063583815028901e-05, 'epoch': 4.35} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0977, 'learning_rate': 5.0462427745664734e-05, 'epoch': 4.36} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0831, 'learning_rate': 5.028901734104046e-05, 'epoch': 4.36} 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▏ | 1939/2230 [6:30:41<1:01:05, 12.60s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0901, 'learning_rate': 5.011560693641618e-05, 'epoch': 4.36} 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1007, 'learning_rate': 4.99421965317919e-05, 'epoch': 4.36} 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0779, 'learning_rate': 4.976878612716762e-05, 'epoch': 4.37} 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0516, 'learning_rate': 4.959537572254335e-05, 'epoch': 4.37} 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▍ | 1945/2230 [6:32:00<1:02:07, 13.08s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0659, 'learning_rate': 4.942196531791907e-05, 'epoch': 4.37} 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0775, 'learning_rate': 4.924855491329479e-05, 'epoch': 4.37} 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0825, 'learning_rate': 4.907514450867052e-05, 'epoch': 4.37} 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0651, 'learning_rate': 4.890173410404624e-05, 'epoch': 4.38} 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0806, 'learning_rate': 4.872832369942196e-05, 'epoch': 4.38} 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 87%|█████████████████████████████████████████████████████████████████▌ | 1949/2230 [6:32:52<1:00:09, 12.84s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▍ | 1954/2230 [6:33:56<58:20, 12.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1955/2230 [6:34:08<57:38, 12.58s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0702, 'learning_rate': 4.803468208092485e-05, 'epoch': 4.39} 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0604, 'learning_rate': 4.786127167630058e-05, 'epoch': 4.39} 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▌ | 1956/2230 [6:34:21<56:56, 12.47s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0662, 'learning_rate': 4.76878612716763e-05, 'epoch': 4.39} [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0537, 'learning_rate': 4.751445086705202e-05, 'epoch': 4.39} [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:07:33,181 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0625, 'learning_rate': 4.734104046242774e-05, 'epoch': 4.4} 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▋ | 1961/2230 [6:35:20<53:37, 11.96s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0728, 'learning_rate': 4.716763005780347e-05, 'epoch': 4.4} [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0644, 'learning_rate': 4.699421965317919e-05, 'epoch': 4.4} [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:08:14,192 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0851, 'learning_rate': 4.682080924855491e-05, 'epoch': 4.4} 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0825, 'learning_rate': 4.6647398843930636e-05, 'epoch': 4.41} 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0469, 'learning_rate': 4.647398843930636e-05, 'epoch': 4.41} 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▊ | 1964/2230 [6:35:57<53:29, 12.06s/it] Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:03,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:03,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0861, 'learning_rate': 4.630057803468208e-05, 'epoch': 4.41} 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0768, 'learning_rate': 4.61271676300578e-05, 'epoch': 4.41} 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.084, 'learning_rate': 4.5953757225433526e-05, 'epoch': 4.41} 88%|███████████████████████████████████████████████████████████████████▉ | 1967/2230 [6:36:30<50:18, 11.48s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.072, 'learning_rate': 4.5780346820809246e-05, 'epoch': 4.42} [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:33,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 88%|████████████████████████████████████████████████████████████████████ | 1971/2230 [6:37:12<46:05, 10.68s/it]g-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:51,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:51,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:51,891 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:57,928 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:09:57,928 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0594, 'learning_rate': 4.5433526011560694e-05, 'epoch': 4.42} [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:02,386 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:04,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:04,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:08,701 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:08,701 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0618, 'learning_rate': 4.5260115606936415e-05, 'epoch': 4.42} [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:12,986 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:12,986 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:12,986 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:10:12,986 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 22:42:14,220 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▏ | 1974/2230 [6:37:41<42:08, 9.88s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▏ | 1974/2230 [6:37:41<42:08, 9.88s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:22,402 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:24,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:24,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:24,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:24,506 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:30,543 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:32,570 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:34,567 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:36,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:36,536 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:38,566 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:40,428 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:42,240 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:44,068 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:44,068 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:45,940 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:47,680 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:50,968 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:50,968 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:52,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:54,240 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:57,242 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:57,242 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:10:58,754 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:00,125 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:02,729 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:02,729 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:05,260 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:07,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:07,535 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:08,714 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:10,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:10,804 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:12,847 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:15,607 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:15,607 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:17,384 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:18,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:18,895 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1286, 'learning_rate': 4.335260115606936e-05, 'epoch': 4.45} [WARNING|modeling_utils.py:388] 2022-03-22 23:11:22,616 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:22,616 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:26,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:26,184 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:29,697 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:33,230 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:33,230 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1647, 'learning_rate': 4.3179190751445084e-05, 'epoch': 4.45} [WARNING|modeling_utils.py:388] 2022-03-22 23:11:36,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:36,757 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:40,224 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:43,701 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:43,701 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:47,109 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:47,109 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1068, 'learning_rate': 4.300578034682081e-05, 'epoch': 4.45} [WARNING|modeling_utils.py:388] 2022-03-22 23:11:50,613 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:54,075 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:54,075 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:57,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:11:57,522 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:00,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:00,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:04,383 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:04,383 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:04,383 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1077, 'learning_rate': 4.265895953757225e-05, 'epoch': 4.46} [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.098, 'learning_rate': 4.248554913294798e-05, 'epoch': 4.46} [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1061, 'learning_rate': 4.23121387283237e-05, 'epoch': 4.46} [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1334, 'learning_rate': 4.213872832369942e-05, 'epoch': 4.46} [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:12:09,591 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0944, 'learning_rate': 4.196531791907514e-05, 'epoch': 4.47} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1024, 'learning_rate': 4.179190751445087e-05, 'epoch': 4.47} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0711, 'learning_rate': 4.161849710982658e-05, 'epoch': 4.47} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0685, 'learning_rate': 4.1445086705202304e-05, 'epoch': 4.47} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0949, 'learning_rate': 4.1271676300578025e-05, 'epoch': 4.48} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0936, 'learning_rate': 4.109826589595375e-05, 'epoch': 4.48} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0813, 'learning_rate': 4.092485549132947e-05, 'epoch': 4.48} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0909, 'learning_rate': 4.075144508670519e-05, 'epoch': 4.48} 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 89%|████████████████████████████████████████████████████████████████████▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/22/2022 23:24:38 - INFO - datasets.metric - Removing /home/sanchit_huggingface_co/.cache/huggingface/metrics/wer/default/default_experiment-1-0.arrow {'eval_loss': 0.309664785861969, 'eval_wer': 0.09321697738992463, 'eval_runtime': 583.4014, 'eval_samples_per_second': 4.529, 'eval_steps_per_second': 0.567, 'epoch': 4.48} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0722, 'learning_rate': 4.040462427745664e-05, 'epoch': 4.49} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0647, 'learning_rate': 4.023121387283236e-05, 'epoch': 4.49} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.096, 'learning_rate': 4.005780346820808e-05, 'epoch': 4.49} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0915, 'learning_rate': 3.988439306358381e-05, 'epoch': 4.49} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0799, 'learning_rate': 3.971098265895953e-05, 'epoch': 4.5} [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2366] 2022-03-22 23:14:54,855 >> Num examples = 2642██▊ | 1992/2230 [6:40:33<51:48, 13.06s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0812, 'learning_rate': 3.953757225433525e-05, 'epoch': 4.5} 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▍ | 2006/2230 [6:55:01<2:57:32, 47.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0717, 'learning_rate': 3.936416184971098e-05, 'epoch': 4.5} g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0719, 'learning_rate': 3.91907514450867e-05, 'epoch': 4.5} g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0865, 'learning_rate': 3.901734104046242e-05, 'epoch': 4.5} g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0561, 'learning_rate': 3.884393063583814e-05, 'epoch': 4.51} 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.067, 'learning_rate': 3.867052023121387e-05, 'epoch': 4.51} 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▌ | 2010/2230 [6:55:52<1:17:21, 21.10s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0446, 'learning_rate': 3.849710982658959e-05, 'epoch': 4.51} 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.053, 'learning_rate': 3.832369942196531e-05, 'epoch': 4.51} 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0593, 'learning_rate': 3.815028901734104e-05, 'epoch': 4.52} 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|███████████████████████████████████████████████████████████████████▋ | 2012/2230 [6:56:17<1:00:09, 16.56s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▌ | 2015/2230 [6:56:54<48:40, 13.59s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▌ | 2015/2230 [6:56:54<48:40, 13.59s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▌ | 2015/2230 [6:56:54<48:40, 13.59s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▌ | 2015/2230 [6:56:54<48:40, 13.59s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▌ | 2015/2230 [6:56:54<48:40, 13.59s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0629, 'learning_rate': 3.780346820809248e-05, 'epoch': 4.52} [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:29:41,939 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0547, 'learning_rate': 3.745664739884393e-05, 'epoch': 4.52} 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 90%|█████████████████████████████████████████████████████████████████████▋ | 2017/2230 [6:57:16<44:03, 12.41s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:12,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:12,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:12,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0735, 'learning_rate': 3.728323699421965e-05, 'epoch': 4.53} [WARNING|modeling_utils.py:388] 2022-03-22 23:30:12,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:12,298 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:22,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:22,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:22,888 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0501, 'learning_rate': 3.710982658959537e-05, 'epoch': 4.53} [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:29,023 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:29,023 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0683, 'learning_rate': 3.6936416184971096e-05, 'epoch': 4.53} [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:32,876 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:44,748 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:44,748 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0832, 'learning_rate': 3.6763005780346816e-05, 'epoch': 4.53} [WARNING|modeling_utils.py:388] 2022-03-22 23:30:44,748 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:50,974 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:30:50,974 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:55,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:55,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0498, 'learning_rate': 3.658959537572254e-05, 'epoch': 4.54} [WARNING|modeling_bart.py:1051] 2022-03-22 23:30:55,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:31:01,251 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:31:03,546 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|█████████████████████████████████████████████████████████████████████▉ | 2024/2230 [6:58:28<34:44, 10.12s/it] Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|█████████████████████████████████████████████████████████████████████▉ | 2024/2230 [6:58:28<34:44, 10.12s/it] Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:07,404 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:09,589 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:11,795 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:11,795 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:11,795 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:11,795 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:17,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:20,008 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:21,986 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:23,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:23,946 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:25,966 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:27,869 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:29,759 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:31,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:31,599 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:33,525 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:35,266 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:36,987 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:38,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:38,623 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:41,904 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:43,447 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:44,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:44,948 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:47,913 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:49,315 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:52,007 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:52,007 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:53,221 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:55,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:55,578 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:57,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:59,819 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:31:59,819 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:01,760 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:03,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:03,620 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:06,239 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:06,971 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:06,971 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1192, 'learning_rate': 3.4682080924855485e-05, 'epoch': 4.56} [WARNING|modeling_utils.py:388] 2022-03-22 23:32:10,794 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:14,462 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:14,462 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:18,124 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:18,124 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:21,664 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:21,664 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1441, 'learning_rate': 3.450867052023121e-05, 'epoch': 4.56} [WARNING|modeling_utils.py:388] 2022-03-22 23:32:25,238 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:28,825 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:28,825 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:32,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:32,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:35,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:35,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1257, 'learning_rate': 3.433526011560693e-05, 'epoch': 4.57} [WARNING|modeling_utils.py:388] 2022-03-22 23:32:39,363 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:42,830 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:42,830 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:46,290 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:46,290 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:49,680 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:49,680 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:32:53,173 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1209, 'learning_rate': 3.38150289017341e-05, 'epoch': 4.57} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0767, 'learning_rate': 3.364161849710982e-05, 'epoch': 4.57} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0638, 'learning_rate': 3.346820809248554e-05, 'epoch': 4.58} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0763, 'learning_rate': 3.329479768786127e-05, 'epoch': 4.58} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1027, 'learning_rate': 3.312138728323699e-05, 'epoch': 4.58} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0879, 'learning_rate': 3.294797687861271e-05, 'epoch': 4.58} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.072, 'learning_rate': 3.277456647398844e-05, 'epoch': 4.59} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0753, 'learning_rate': 3.260115606936416e-05, 'epoch': 4.59} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0689, 'learning_rate': 3.242774566473988e-05, 'epoch': 4.59} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.071, 'learning_rate': 3.22543352601156e-05, 'epoch': 4.59} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0758, 'learning_rate': 3.208092485549133e-05, 'epoch': 4.59} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0574, 'learning_rate': 3.190751445086705e-05, 'epoch': 4.6} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0876, 'learning_rate': 3.173410404624277e-05, 'epoch': 4.6} 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 91%|██████████████████████████████████████████████████████████████████████▎ | 2038/2230 [7:00:29<39:34, 12.37s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▊ | 2052/2230 [7:03:35<38:48, 13.08s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2053/2230 [7:03:47<38:01, 12.89s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2054/2230 [7:04:00<37:22, 12.74s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.067, 'learning_rate': 3.1040462427745667e-05, 'epoch': 4.61} 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2055/2230 [7:04:12<36:48, 12.62s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0614, 'learning_rate': 3.086705202312139e-05, 'epoch': 4.61} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0644, 'learning_rate': 3.069364161849711e-05, 'epoch': 4.61} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0702, 'learning_rate': 3.052023121387283e-05, 'epoch': 4.61} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0629, 'learning_rate': 3.0346820809248553e-05, 'epoch': 4.62} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0638, 'learning_rate': 3.0173410404624277e-05, 'epoch': 4.62} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0759, 'learning_rate': 2.9999999999999997e-05, 'epoch': 4.62} 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|██████████████████████████████████████████████████████████████████████▉ | 2056/2230 [7:04:24<36:11, 12.48s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0786, 'learning_rate': 2.982658959537572e-05, 'epoch': 4.62} 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0659, 'learning_rate': 2.9653179190751446e-05, 'epoch': 4.63} 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 92%|███████████████████████████████████████████████████████████████████████▏ | 2062/2230 [7:05:36<33:30, 11.97s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:38:40,630 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:38:40,630 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:38:40,630 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0729, 'learning_rate': 2.930635838150289e-05, 'epoch': 4.63} [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:38:46,764 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0761, 'learning_rate': 2.9132947976878608e-05, 'epoch': 4.63} 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0656, 'learning_rate': 2.895953757225433e-05, 'epoch': 4.63} 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▎ | 2066/2230 [7:06:24<31:56, 11.69s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:21,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:21,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:21,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0555, 'learning_rate': 2.8786127167630052e-05, 'epoch': 4.64} [WARNING|modeling_utils.py:388] 2022-03-22 23:39:21,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:21,098 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0612, 'learning_rate': 2.8612716763005776e-05, 'epoch': 4.64} [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:31,649 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:43,375 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:43,375 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0519, 'learning_rate': 2.8439306358381497e-05, 'epoch': 4.64} [WARNING|modeling_bart.py:1051] 2022-03-22 23:39:48,060 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:39:48,060 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:39:48,060 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:53,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:53,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0574, 'learning_rate': 2.826589595375722e-05, 'epoch': 4.64} [WARNING|modeling_utils.py:388] 2022-03-22 23:39:53,518 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:59,813 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:39:59,813 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▌ | 2072/2230 [7:07:26<27:27, 10.43s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▌ | 2072/2230 [7:07:26<27:27, 10.43s/it]g-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:05,970 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:05,970 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:10,355 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:10,355 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:14,396 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:14,396 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0728, 'learning_rate': 2.7919075144508666e-05, 'epoch': 4.65} [WARNING|modeling_utils.py:388] 2022-03-22 23:40:14,396 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:20,265 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:22,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:22,576 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0571, 'learning_rate': 2.774566473988439e-05, 'epoch': 4.65} [WARNING|modeling_utils.py:388] 2022-03-22 23:40:26,027 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:28,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:28,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:28,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:40:28,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0653, 'learning_rate': 2.757225433526011e-05, 'epoch': 4.65} [WARNING|modeling_utils.py:388] 2022-03-22 23:40:28,220 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:38,215 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:40,290 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:10:18,697 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▋ | 2076/2230 [7:08:05<24:47, 9.66s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:40:42,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▋ | 2076/2230 [7:08:05<24:47, 9.66s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:40:42,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:44,323 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:42,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:46,223 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:42,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:48,104 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:42,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▋ | 2077/2230 [7:08:12<23:09, 9.08s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▋ | 2077/2230 [7:08:12<23:09, 9.08s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:51,899 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:53,689 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:55,476 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:40:55,476 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:50,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▊ | 2078/2230 [7:08:20<21:37, 8.54s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:40:57,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:00,548 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:57,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:02,166 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:57,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:02,166 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:40:57,252 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▊ | 2079/2230 [7:08:26<20:01, 7.95s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:03,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:05,339 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:03,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:08,309 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:03,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:08,309 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:03,824 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▊ | 2080/2230 [7:08:32<18:26, 7.38s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:09,781 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:12,431 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:09,781 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▊ | 2081/2230 [7:08:38<16:42, 6.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:14,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▊ | 2081/2230 [7:08:38<16:42, 6.73s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:14,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:16,067 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:14,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2082/2230 [7:08:42<14:54, 6.05s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:14,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2082/2230 [7:08:42<14:54, 6.05s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:14,912 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:20,310 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:19,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2083/2230 [7:08:46<13:15, 5.41s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:19,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2083/2230 [7:08:46<13:15, 5.41s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:19,326 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:24,899 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:23,201 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2084/2230 [7:08:49<11:40, 4.80s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:23,201 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2084/2230 [7:08:49<11:40, 4.80s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:23,201 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2084/2230 [7:08:49<11:40, 4.80s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:31,161 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:31,161 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:34,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:34,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:38,237 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2085/2230 [7:09:04<18:33, 7.68s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2085/2230 [7:09:04<18:33, 7.68s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:27,522 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2085/2230 [7:09:04<18:33, 7.68s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 93%|███████████████████████████████████████████████████████████████████████▉ | 2085/2230 [7:09:04<18:33, 7.68s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:45,337 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:48,778 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:48,778 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:52,281 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2086/2230 [7:09:18<22:58, 9.57s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2086/2230 [7:09:18<22:58, 9.57s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:41,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2086/2230 [7:09:18<22:58, 9.57s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:59,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:41:59,346 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:02,812 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:02,812 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:06,342 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2087/2230 [7:09:32<26:01, 10.92s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2087/2230 [7:09:32<26:01, 10.92s/it] Setting `use_cache=False`...1] 2022-03-22 23:41:55,833 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████ | 2087/2230 [7:09:32<26:01, 10.92s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0872, 'learning_rate': 2.5317919075144507e-05, 'epoch': 4.68} [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:42:13,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1099, 'learning_rate': 2.514450867052023e-05, 'epoch': 4.68} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.094, 'learning_rate': 2.497109826589595e-05, 'epoch': 4.69} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1113, 'learning_rate': 2.4797687861271675e-05, 'epoch': 4.69} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0828, 'learning_rate': 2.4624277456647396e-05, 'epoch': 4.69} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0799, 'learning_rate': 2.445086705202312e-05, 'epoch': 4.69} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0837, 'learning_rate': 2.427745664739884e-05, 'epoch': 4.7} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0823, 'learning_rate': 2.4104046242774565e-05, 'epoch': 4.7} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0878, 'learning_rate': 2.393063583815029e-05, 'epoch': 4.7} 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▏ | 2089/2230 [7:10:01<29:52, 12.71s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0632, 'learning_rate': 2.375722543352601e-05, 'epoch': 4.7} 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0509, 'learning_rate': 2.3583815028901734e-05, 'epoch': 4.7} 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0624, 'learning_rate': 2.3410404624277454e-05, 'epoch': 4.71} 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0719, 'learning_rate': 2.323699421965318e-05, 'epoch': 4.71} 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0763, 'learning_rate': 2.30635838150289e-05, 'epoch': 4.71} 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▍ | 2097/2230 [7:11:47<29:04, 13.12s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0815, 'learning_rate': 2.2890173410404623e-05, 'epoch': 4.71} 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2102/2230 [7:12:53<27:52, 13.07s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0548, 'learning_rate': 2.2716763005780347e-05, 'epoch': 4.72} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0742, 'learning_rate': 2.2543352601156068e-05, 'epoch': 4.72} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0509, 'learning_rate': 2.2369942196531792e-05, 'epoch': 4.72} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0468, 'learning_rate': 2.2196531791907513e-05, 'epoch': 4.72} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0499, 'learning_rate': 2.2023121387283237e-05, 'epoch': 4.72} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0682, 'learning_rate': 2.184971098265896e-05, 'epoch': 4.73} 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 94%|████████████████████████████████████████████████████████████████████████▌ | 2103/2230 [7:13:05<27:16, 12.88s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.054, 'learning_rate': 2.167630057803468e-05, 'epoch': 4.73} 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0559, 'learning_rate': 2.1502890173410405e-05, 'epoch': 4.73} 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0495, 'learning_rate': 2.1329479768786126e-05, 'epoch': 4.73} 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0674, 'learning_rate': 2.115606936416185e-05, 'epoch': 4.74} 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▊ | 2109/2230 [7:14:19<24:45, 12.28s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0673, 'learning_rate': 2.098265895953757e-05, 'epoch': 4.74} 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0707, 'learning_rate': 2.080924855491329e-05, 'epoch': 4.74} 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0635, 'learning_rate': 2.0635838150289012e-05, 'epoch': 4.74} 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|████████████████████████████████████████████████████████████████████████▉ | 2113/2230 [7:15:08<24:18, 12.46s/it] Setting `use_cache=False`...1] 2022-03-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:22,028 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:22,028 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:26,143 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:26,143 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:30,260 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:30,260 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0424, 'learning_rate': 2.028901734104046e-05, 'epoch': 4.75} [WARNING|modeling_utils.py:388] 2022-03-22 23:48:34,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:34,365 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0546, 'learning_rate': 2.011560693641618e-05, 'epoch': 4.75} [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0524, 'learning_rate': 1.9942196531791905e-05, 'epoch': 4.75} [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:38,385 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:48:58,250 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:00,813 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:00,813 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0984, 'learning_rate': 1.9768786127167626e-05, 'epoch': 4.75} [WARNING|modeling_utils.py:388] 2022-03-22 23:49:04,710 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:04,710 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:04,710 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:04,710 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:04,710 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:42:09,884 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|█████████████████████████████████████████████████████████████████████████▏ | 2121/2230 [7:16:35<19:25, 10.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|█████████████████████████████████████████████████████████████████████████▏ | 2121/2230 [7:16:35<19:25, 10.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|█████████████████████████████████████████████████████████████████████████▏ | 2121/2230 [7:16:35<19:25, 10.69s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:19,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:19,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:19,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0632, 'learning_rate': 1.942196531791907e-05, 'epoch': 4.76} [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:19,294 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:27,131 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:27,131 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:31,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:31,430 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0498, 'learning_rate': 1.9248554913294795e-05, 'epoch': 4.76} [WARNING|modeling_utils.py:388] 2022-03-22 23:49:35,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:49:35,455 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:39,593 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:13,151 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|█████████████████████████████████████████████████████████████████████████▎ | 2124/2230 [7:17:04<17:33, 9.94s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 95%|█████████████████████████████████████████████████████████████████████████▎ | 2124/2230 [7:17:04<17:33, 9.94s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0575, 'learning_rate': 1.907514450867052e-05, 'epoch': 4.76} 95%|█████████████████████████████████████████████████████████████████████████▎ | 2124/2230 [7:17:04<17:33, 9.94s/it][WARNING|modeling_bart.py:1051] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:47,452 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:49,561 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:49,561 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:49,561 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.049, 'learning_rate': 1.890173410404624e-05, 'epoch': 4.76} [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:55,689 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:57,701 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:59,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:49:59,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:01,734 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:03,595 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:05,441 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:07,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:07,275 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:09,129 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:10,897 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:12,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:12,603 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:15,985 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:17,611 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:19,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:19,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:22,189 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:23,576 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:26,256 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:26,256 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:27,608 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:30,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:30,070 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:32,436 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:34,568 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:34,568 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:35,601 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:38,499 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:38,499 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:40,361 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:41,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:41,995 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:42,721 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:46,072 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:46,072 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:49,667 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:49,667 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:53,240 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:56,763 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:50:56,763 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1591, 'learning_rate': 1.7167630057803466e-05, 'epoch': 4.79} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:00,364 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:00,364 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:03,918 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:07,358 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:07,358 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:10,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:10,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1144, 'learning_rate': 1.699421965317919e-05, 'epoch': 4.79} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:14,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:14,341 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:17,784 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:21,234 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:21,234 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:24,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:24,679 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1283, 'learning_rate': 1.682080924855491e-05, 'epoch': 4.79} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:28,149 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:31,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:31,505 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.102, 'learning_rate': 1.6647398843930635e-05, 'epoch': 4.79} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0936, 'learning_rate': 1.6473988439306356e-05, 'epoch': 4.8} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0992, 'learning_rate': 1.630057803468208e-05, 'epoch': 4.8} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0878, 'learning_rate': 1.61271676300578e-05, 'epoch': 4.8} [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:51:36,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0724, 'learning_rate': 1.5953757225433525e-05, 'epoch': 4.8} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0741, 'learning_rate': 1.578034682080925e-05, 'epoch': 4.8} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0743, 'learning_rate': 1.560693641618497e-05, 'epoch': 4.81} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0643, 'learning_rate': 1.5433526011560694e-05, 'epoch': 4.81} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.087, 'learning_rate': 1.5260115606936414e-05, 'epoch': 4.81} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0482, 'learning_rate': 1.5086705202312138e-05, 'epoch': 4.81} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0634, 'learning_rate': 1.491329479768786e-05, 'epoch': 4.82} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0908, 'learning_rate': 1.4739884393063583e-05, 'epoch': 4.82} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0669, 'learning_rate': 1.4566473988439304e-05, 'epoch': 4.82} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0904, 'learning_rate': 1.4393063583815026e-05, 'epoch': 4.82} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0609, 'learning_rate': 1.4219653179190749e-05, 'epoch': 4.83} 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 96%|█████████████████████████████████████████████████████████████████████████▉ | 2142/2230 [7:19:58<19:18, 13.17s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0588, 'learning_rate': 1.4046242774566473e-05, 'epoch': 4.83} 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▎ | 2153/2230 [7:22:21<16:20, 12.74s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.082, 'learning_rate': 1.3872832369942195e-05, 'epoch': 4.83} 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2154/2230 [7:22:33<15:58, 12.62s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0537, 'learning_rate': 1.3699421965317917e-05, 'epoch': 4.83} 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0742, 'learning_rate': 1.352601156069364e-05, 'epoch': 4.83} 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2155/2230 [7:22:45<15:36, 12.48s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0743, 'learning_rate': 1.3179190751445084e-05, 'epoch': 4.84} 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0412, 'learning_rate': 1.3005780346820809e-05, 'epoch': 4.84} 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▍ | 2157/2230 [7:23:10<14:58, 12.31s/it] Setting `use_cache=False`...1] 2022-03-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0736, 'learning_rate': 1.2832369942196531e-05, 'epoch': 4.84} [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:56:22,303 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0472, 'learning_rate': 1.2658959537572253e-05, 'epoch': 4.85} 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0592, 'learning_rate': 1.2485549132947976e-05, 'epoch': 4.85} 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.066, 'learning_rate': 1.2312138728323698e-05, 'epoch': 4.85} 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▌ | 2161/2230 [7:23:57<13:44, 11.94s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.061, 'learning_rate': 1.213872832369942e-05, 'epoch': 4.85} 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▋ | 2164/2230 [7:24:34<13:19, 12.11s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0519, 'learning_rate': 1.1965317919075144e-05, 'epoch': 4.85} [WARNING|modeling_utils.py:388] 2022-03-22 23:57:27,707 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:57:27,707 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:57:27,707 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▊ | 2166/2230 [7:24:56<12:27, 11.69s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▊ | 2166/2230 [7:24:56<12:27, 11.69s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0337, 'learning_rate': 1.1791907514450867e-05, 'epoch': 4.86} 97%|██████████████████████████████████████████████████████████████████████████▊ | 2166/2230 [7:24:56<12:27, 11.69s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0496, 'learning_rate': 1.161849710982659e-05, 'epoch': 4.86} [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0457, 'learning_rate': 1.1445086705202312e-05, 'epoch': 4.86} [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:57:39,695 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:03,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:03,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:03,956 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:07,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:07,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:07,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:14,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:14,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:14,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.085, 'learning_rate': 1.1098265895953756e-05, 'epoch': 4.87} [WARNING|modeling_bart.py:1051] 2022-03-22 23:58:20,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:58:20,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-22 23:58:20,359 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▉ | 2171/2230 [7:25:49<10:25, 10.61s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 97%|██████████████████████████████████████████████████████████████████████████▉ | 2171/2230 [7:25:49<10:25, 10.61s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:28,350 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:28,350 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:28,350 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:34,398 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:34,398 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0437, 'learning_rate': 1.0751445086705203e-05, 'epoch': 4.87} [WARNING|modeling_utils.py:388] 2022-03-22 23:58:34,398 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:40,420 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:42,777 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:42,777 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:42,777 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0581, 'learning_rate': 1.0578034682080925e-05, 'epoch': 4.87} [WARNING|modeling_utils.py:388] 2022-03-22 23:58:48,595 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:50,805 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:53,039 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:53,039 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0467, 'learning_rate': 1.0404624277456646e-05, 'epoch': 4.87} [WARNING|modeling_utils.py:388] 2022-03-22 23:58:53,039 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:58:58,575 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:00,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:00,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:00,677 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0702, 'learning_rate': 1.0231213872832368e-05, 'epoch': 4.88} [WARNING|modeling_utils.py:388] 2022-03-22 23:59:06,562 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:08,555 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:10,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:10,508 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:12,419 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:14,389 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:16,244 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:18,061 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:18,061 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:19,826 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:21,602 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:24,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:24,926 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:26,579 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:28,234 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:31,302 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:32,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:32,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:34,293 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:37,001 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:37,001 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:38,297 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:39,651 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:43,372 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:43,372 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:44,657 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:46,871 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:46,871 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:48,981 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:50,826 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:50,826 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:52,713 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:55,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:55,126 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.061, 'learning_rate': 8.670520231213871e-06, 'epoch': 4.9} [WARNING|modeling_utils.py:388] 2022-03-22 23:59:59,061 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-22 23:59:59,061 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:02,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:02,573 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:06,154 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:09,696 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:09,696 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.111, 'learning_rate': 8.497109826589595e-06, 'epoch': 4.9} [WARNING|modeling_utils.py:388] 2022-03-23 00:00:13,290 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:13,290 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:16,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:16,823 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:20,318 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:23,769 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:23,769 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0942, 'learning_rate': 8.323699421965318e-06, 'epoch': 4.9} [WARNING|modeling_utils.py:388] 2022-03-23 00:00:27,358 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:27,358 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:30,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:34,117 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:34,117 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:34,117 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:37,481 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:37,481 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:40,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:40,936 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0627, 'learning_rate': 7.976878612716762e-06, 'epoch': 4.91} [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:00:46,249 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0891, 'learning_rate': 7.803468208092485e-06, 'epoch': 4.91} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0673, 'learning_rate': 7.630057803468207e-06, 'epoch': 4.91} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.087, 'learning_rate': 7.45664739884393e-06, 'epoch': 4.91} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0792, 'learning_rate': 7.283236994219652e-06, 'epoch': 4.91} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0868, 'learning_rate': 7.109826589595374e-06, 'epoch': 4.92} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0859, 'learning_rate': 6.9364161849710975e-06, 'epoch': 4.92} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0631, 'learning_rate': 6.76300578034682e-06, 'epoch': 4.92} 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▌ | 2189/2230 [7:28:30<08:38, 12.64s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0661, 'learning_rate': 6.589595375722542e-06, 'epoch': 4.92} 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 98%|███████████████████████████████████████████████████████████████████████████▊ | 2196/2230 [7:30:02<07:21, 12.97s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0472, 'learning_rate': 6.4161849710982654e-06, 'epoch': 4.93} 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0575, 'learning_rate': 6.242774566473988e-06, 'epoch': 4.93} 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0572, 'learning_rate': 6.06936416184971e-06, 'epoch': 4.93} 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0756, 'learning_rate': 5.895953757225433e-06, 'epoch': 4.93} 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▊ | 2197/2230 [7:30:14<07:06, 12.91s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|███████████████████████████████████████████████████████████████████████████▉ | 2201/2230 [7:31:07<06:21, 13.15s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0564, 'learning_rate': 5.549132947976878e-06, 'epoch': 4.94} 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2202/2230 [7:31:19<06:01, 12.92s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0701, 'learning_rate': 5.375722543352601e-06, 'epoch': 4.94} 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0484, 'learning_rate': 5.202312138728323e-06, 'epoch': 4.94} 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████ | 2203/2230 [7:31:32<05:44, 12.77s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.068, 'learning_rate': 5.028901734104045e-06, 'epoch': 4.94} 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2205/2230 [7:31:56<05:11, 12.48s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0546, 'learning_rate': 4.855491329479768e-06, 'epoch': 4.95} 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0635, 'learning_rate': 4.682080924855491e-06, 'epoch': 4.95} 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0693, 'learning_rate': 4.508670520231213e-06, 'epoch': 4.95} 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▏| 2206/2230 [7:32:08<04:56, 12.35s/it]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0459, 'learning_rate': 4.335260115606936e-06, 'epoch': 4.95} [WARNING|modeling_bart.py:1051] 2022-03-23 00:05:24,981 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:05:24,981 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:05:24,981 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:05:24,981 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0561, 'learning_rate': 4.161849710982659e-06, 'epoch': 4.96} 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0614, 'learning_rate': 3.988439306358381e-06, 'epoch': 4.96} 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▎| 2210/2230 [7:32:56<03:58, 11.94s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:05:55,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:05:55,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0694, 'learning_rate': 3.8150289017341036e-06, 'epoch': 4.96} [WARNING|modeling_utils.py:388] 2022-03-23 00:05:55,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:05:55,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:05:55,778 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0504, 'learning_rate': 3.641618497109826e-06, 'epoch': 4.96} [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:06,090 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.064, 'learning_rate': 3.294797687861271e-06, 'epoch': 4.97} [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:06:22,554 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:40,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:40,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0734, 'learning_rate': 3.121387283236994e-06, 'epoch': 4.97} [WARNING|modeling_utils.py:388] 2022-03-23 00:06:40,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:40,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:40,483 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:51,035 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:51,035 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0597, 'learning_rate': 2.9479768786127167e-06, 'epoch': 4.97} [WARNING|modeling_utils.py:388] 2022-03-23 00:06:54,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:54,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:06:54,942 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:00,867 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:00,867 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 99%|████████████████████████████████████████████████████████████████████████████▌| 2218/2230 [7:34:26<02:10, 10.84s/it] Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:05,209 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:05,209 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:05,209 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:11,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:11,357 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0282, 'learning_rate': 2.6011560693641614e-06, 'epoch': 4.98} [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:15,810 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:15,810 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:19,822 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:19,822 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:19,822 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.0546, 'learning_rate': 2.427745664739884e-06, 'epoch': 4.98} [WARNING|modeling_utils.py:388] 2022-03-23 00:07:25,712 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:27,976 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:30,187 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:30,187 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:30,187 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:34,214 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_bart.py:1051] 2022-03-23 00:07:34,214 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...e computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:37,724 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:39,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:39,736 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:41,815 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:43,780 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:45,690 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:47,531 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:47,531 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:49,439 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:51,240 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:53,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:53,005 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:56,464 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:58,080 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:59,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:59,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:07:59,666 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:04,364 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:05,789 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:08,497 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:08,497 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:09,896 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:12,332 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:13,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:13,490 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:15,747 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:17,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:17,802 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:19,824 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:21,647 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:21,647 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:24,137 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|modeling_utils.py:388] 2022-03-23 00:08:24,137 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'loss': 0.1199, 'learning_rate': 6.936416184971098e-07, 'epoch': 5.0} [INFO|configuration_utils.py:438] 2022-03-23 00:08:24,862 >> Configuration saved in ./config.jsons of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|configuration_utils.py:438] 2022-03-23 00:08:36,626 >> Configuration saved in ./config.jsons of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|configuration_utils.py:438] 2022-03-23 00:08:36,626 >> Configuration saved in ./config.jsons of the input, floating-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 4%|▍ | 8.16M/216M [00:01<00:25, 8.49MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 19%|██▎ | 41.2M/216M [00:03<00:12, 15.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 35%|████▏ | 75.2M/216M [00:05<00:08, 16.9MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 51%|██████▋ | 110M/216M [00:07<00:06, 17.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 66%|████████▋ | 144M/216M [00:09<00:04, 17.4MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 83%|██████████▋ | 178M/216M [00:11<00:02, 17.8MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... remote: tput: No value for $TERM and no -T specified wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... remote: tput: No value for $TERM and no -T specified wandb: 99%|████████████▊| 214M/216M [00:13<00:00, 18.2MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/23/2022 00:11:56 - WARNING - huggingface_hub.repository - remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified To https://huggingface.co/sanchit-gandhi/wav2vec2-2-bart-large-cnn Upload file runs/Mar22_16-32-07_sanchit--v100/events.out.tfevents.1647966754.sanchit--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... {'dataset': {'name': 'librispeech_asr', 'type': 'librispeech_asr', 'args': 'clean'}}--v100.270815.0: 100%|█| 352k/352k [g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/23/2022 00:12:15 - WARNING - huggingface_hub.repository - remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified To https://huggingface.co/sanchit-gandhi/wav2vec2-2-bart-large-cnn [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... ***** train metrics ***** epoch = 5.0 train_loss = 1.483 train_runtime = 7:35:49.98 train_samples = 28538 train_samples_per_second = 5.217 train_steps_per_second = 0.082 03/23/2022 00:12:18 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/23/2022 00:21:53 - INFO - datasets.metric - Removing /home/sanchit_huggingface_co/.cache/huggingface/metrics/wer/default/default_experiment-1-0.arrow ***** eval metrics ***** epoch = 5.0 eval_loss = 0.3051 eval_runtime = 0:09:35.24 eval_samples = 2642 eval_samples_per_second = 4.593 eval_steps_per_second = 0.575 eval_wer = 0.0899 [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [INFO|trainer.py:2369] 2022-03-23 00:12:18,345 >> Batch size = 8 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 03/23/2022 00:22:39 - WARNING - huggingface_hub.repository - remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified remote: tput: No value for $TERM and no -T specified To https://huggingface.co/sanchit-gandhi/wav2vec2-2-bart-large-cnn Upload file wandb/run-20220322_163235-2yj5gh94/run-2yj5gh94.wandb: 100%|█████████████| 216M/216M [00:11<00:00, 20.5MB/s]g-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... File "/home/sanchit_huggingface_co/gcp/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 870, in model_infog-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... File "/home/sanchit_huggingface_co/gcp/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 870, in model_infog-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... File "/home/sanchit_huggingface_co/gcp/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 870, in model_infog-point operations will not be computed-22 23:49:41,933 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...