08/12/2023 23:18:17 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/12/2023 23:18:17 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='s-nlp/bart-base-detox', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/8_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='s-nlp/bart-base-detox', student_model='s-nlp/bart-base-detox', pred_distill=True, intermediate_distill=True, weight_bits=8, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/12/2023 23:18:33 - INFO - __main__ - ***** Running training ***** 08/12/2023 23:18:33 - INFO - __main__ - Num examples = 19546 08/12/2023 23:18:33 - INFO - __main__ - Num Epochs = 10 08/12/2023 23:18:33 - INFO - __main__ - Instantaneous batch size per device = 8 08/12/2023 23:18:33 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/12/2023 23:18:33 - INFO - __main__ - Gradient Accumulation steps = 2 08/12/2023 23:18:33 - INFO - __main__ - Total optimization steps = 24440 08/12/2023 23:18:33 - INFO - __main__ - student encoder layers = 3 08/12/2023 23:18:33 - INFO - __main__ - student decoder layers = 1 08/12/2023 23:18:33 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/12/2023 23:18:33 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/12/2023 23:26:48 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 21.572847829671986} 08/12/2023 23:34:51 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 54.29186037002958} 08/12/2023 23:43:04 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 61.69900398207265} 08/12/2023 23:50:59 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 63.62850416895223} 08/12/2023 23:59:06 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.56508552124465} 08/13/2023 00:07:12 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.7503365706928} 08/13/2023 00:15:02 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.56400845774732} 08/13/2023 00:22:51 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 65.14883955028} 08/13/2023 00:31:21 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 65.12240366547515} 08/13/2023 00:39:52 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.96621476538809} 08/13/2023 13:19:30 - WARNING - __main__ - You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with `--source_prefix 'summarize: ' ` 08/13/2023 13:19:30 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/13/2023 13:19:30 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='t5-large', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/8_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='t5-large', student_model='t5-large', pred_distill=True, intermediate_distill=True, weight_bits=8, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/13/2023 13:19:57 - INFO - __main__ - ***** Running training ***** 08/13/2023 13:19:57 - INFO - __main__ - Num examples = 19546 08/13/2023 13:19:57 - INFO - __main__ - Num Epochs = 10 08/13/2023 13:19:57 - INFO - __main__ - Instantaneous batch size per device = 8 08/13/2023 13:19:57 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/13/2023 13:19:57 - INFO - __main__ - Gradient Accumulation steps = 2 08/13/2023 13:19:57 - INFO - __main__ - Total optimization steps = 24440 08/13/2023 13:19:57 - INFO - __main__ - student encoder layers = 3 08/13/2023 13:19:57 - INFO - __main__ - student decoder layers = 1 08/13/2023 13:19:57 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/13/2023 13:19:57 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/13/2023 16:31:44 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/13/2023 16:31:44 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='facebook/bart-large', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/8_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='facebook/bart-large', student_model='facebook/bart-large', pred_distill=True, intermediate_distill=True, weight_bits=8, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/13/2023 16:32:04 - INFO - __main__ - ***** Running training ***** 08/13/2023 16:32:04 - INFO - __main__ - Num examples = 19546 08/13/2023 16:32:04 - INFO - __main__ - Num Epochs = 10 08/13/2023 16:32:04 - INFO - __main__ - Instantaneous batch size per device = 8 08/13/2023 16:32:04 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/13/2023 16:32:04 - INFO - __main__ - Gradient Accumulation steps = 2 08/13/2023 16:32:04 - INFO - __main__ - Total optimization steps = 24440 08/13/2023 16:32:04 - INFO - __main__ - student encoder layers = 3 08/13/2023 16:32:04 - INFO - __main__ - student decoder layers = 1 08/13/2023 16:32:04 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/13/2023 16:32:04 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/13/2023 16:42:11 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 14.204618561425352} 08/13/2023 16:52:21 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 14.451229182170009} 08/13/2023 17:02:34 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 15.582597700705236} 08/13/2023 17:12:46 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 16.570258107614887} 08/13/2023 17:22:59 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 17.896398309811964} 08/13/2023 17:32:49 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 19.05405184103676} 08/13/2023 17:42:51 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 20.369355928033364} 08/13/2023 17:53:04 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 20.83913646867499} 08/13/2023 18:03:09 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 21.37904589790436} 08/13/2023 18:13:24 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 21.84488138786117}