10/29/2022 06:15:53 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: True 10/29/2022 06:15:53 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: True 10/29/2022 06:15:53 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True 10/29/2022 06:15:53 - INFO - __main__ - Training/evaluation parameters OurTrainingArguments(output_dir='out/mabel-joint-cl-al1-mlm-bs-32-lr-3e-5-msl-84-ep-3-bbl', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=, warmup_steps=0, logging_dir='runs/Oct29_06-15-53_a11-02.hpc.usc.edu', logging_first_step=False, logging_steps=500, save_steps=125, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='out/mabel-joint-cl-al1-mlm-bs-32-lr-3e-5-msl-84-ep-3-bbl', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=True, metric_for_best_model='loss', greater_is_better=False, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, eval_transfer=False, report_to='wandb') 10/29/2022 06:15:53 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1 distributed training: True, 16-bits training: True 10/29/2022 06:15:53 - WARNING - datasets.builder - Using custom data configuration default-2f6794b69ce47e79 10/29/2022 06:15:53 - WARNING - datasets.builder - Using custom data configuration default-2f6794b69ce47e79 10/29/2022 06:15:53 - WARNING - datasets.builder - Using custom data configuration default-2f6794b69ce47e79 10/29/2022 06:15:53 - WARNING - datasets.builder - Using custom data configuration default-2f6794b69ce47e79 10/29/2022 06:15:53 - WARNING - datasets.builder - Found cached dataset csv (/project/jonmay_231/jacqueline/mabel/training/.cache/csv/default-2f6794b69ce47e79/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a) 0%| | 0/1 [00:00> loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at .cache/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d [INFO|configuration_utils.py:481] 2022-10-29 06:15:54,434 >> Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } [INFO|configuration_utils.py:445] 2022-10-29 06:15:54,740 >> loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at .cache/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d [INFO|configuration_utils.py:481] 2022-10-29 06:15:54,741 >> Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } [INFO|tokenization_utils_base.py:1766] 2022-10-29 06:15:55,339 >> loading file https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt from cache at .cache/e12f02d630da91a0982ce6db1ad595231d155a2b725ab106971898276d842ecc.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99 [INFO|tokenization_utils_base.py:1766] 2022-10-29 06:15:55,339 >> loading file https://huggingface.co/bert-large-uncased/resolve/main/tokenizer.json from cache at .cache/475d46024228961ca8770cead39e1079f135fd2441d14cf216727ffac8d41d78.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4 [INFO|modeling_utils.py:1027] 2022-10-29 06:15:55,697 >> loading weights file https://huggingface.co/bert-large-uncased/resolve/main/pytorch_model.bin from cache at .cache/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMabel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'] - This IS expected if you are initializing BertForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForMabel were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['lm_head.bias', 'lm_head.transform.dense.weight', 'lm_head.transform.dense.bias', 'lm_head.transform.LayerNorm.weight', 'lm_head.transform.LayerNorm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMabel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'] - This IS expected if you are initializing BertForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForMabel were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['lm_head.bias', 'lm_head.transform.dense.weight', 'lm_head.transform.dense.bias', 'lm_head.transform.LayerNorm.weight', 'lm_head.transform.LayerNorm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:1135] 2022-10-29 06:16:24,169 >> Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMabel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'] - This IS expected if you are initializing BertForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). [WARNING|modeling_utils.py:1146] 2022-10-29 06:16:24,175 >> Some weights of BertForMabel were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['lm_head.bias', 'lm_head.transform.dense.weight', 'lm_head.transform.dense.bias', 'lm_head.transform.LayerNorm.weight', 'lm_head.transform.LayerNorm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMabel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'] - This IS expected if you are initializing BertForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForMabel were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['lm_head.bias', 'lm_head.transform.dense.weight', 'lm_head.transform.dense.bias', 'lm_head.transform.LayerNorm.weight', 'lm_head.transform.LayerNorm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:445] 2022-10-29 06:16:24,467 >> loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at .cache/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d [INFO|configuration_utils.py:481] 2022-10-29 06:16:24,467 >> Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } [INFO|modeling_utils.py:1027] 2022-10-29 06:16:24,768 >> loading weights file https://huggingface.co/bert-large-uncased/resolve/main/pytorch_model.bin from cache at .cache/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['cls.predictions.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:1143] 2022-10-29 06:16:52,152 >> All model checkpoint weights were used when initializing BertForPreTraining. [WARNING|modeling_utils.py:1146] 2022-10-29 06:16:52,153 >> Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['cls.predictions.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/143 [00:00> The following columns in the training set don't have a corresponding argument in `BertForMabel.forward` and have been ignored: . [INFO|trainer.py:358] 2022-10-29 06:20:36,916 >> Using amp fp16 backend 10/29/2022 06:20:40 - INFO - trainer - ***** Running training ***** 10/29/2022 06:20:40 - INFO - trainer - Num examples = 142158 10/29/2022 06:20:40 - INFO - trainer - Num Epochs = 3 10/29/2022 06:20:40 - INFO - trainer - Instantaneous batch size per device = 32 10/29/2022 06:20:40 - INFO - trainer - Total train batch size (w. parallel, distributed & accumulation) = 128 10/29/2022 06:20:40 - INFO - trainer - Gradient Accumulation steps = 1 10/29/2022 06:20:40 - INFO - trainer - Total optimization steps = 3333 0%| | 0/3333 [00:00> Saving model checkpoint to out/mabel-joint-cl-al1-mlm-bs-32-lr-3e-5-msl-84-ep-3-bbl [INFO|configuration_utils.py:300] 2022-10-29 07:56:01,305 >> Configuration saved in out/mabel-joint-cl-al1-mlm-bs-32-lr-3e-5-msl-84-ep-3-bbl/config.json [INFO|modeling_utils.py:817] 2022-10-29 07:56:05,166 >> Model weights saved in out/mabel-joint-cl-al1-mlm-bs-32-lr-3e-5-msl-84-ep-3-bbl/pytorch_model.bin 10/29/2022 07:56:05 - INFO - __main__ - ***** Train results ***** 10/29/2022 07:56:05 - INFO - __main__ - epoch = 3.0 10/29/2022 07:56:05 - INFO - __main__ - train_runtime = 5721.0485 10/29/2022 07:56:05 - INFO - __main__ - train_samples_per_second = 0.583 /home1/jh_445/.conda/envs/pred/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning,