WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: True WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1 distributed training: True, 16-bits training: True WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: True WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True WARNING:datasets.builder:Using custom data configuration default-2f6794b69ce47e79 WARNING:datasets.builder:Reusing dataset csv (/project/jonmay_231/jacqueline/.cache/csv/default-2f6794b69ce47e79/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e) 0%| | 0/1 [00:00> loading configuration file https://huggingface.co/roberta-large/resolve/main/config.json from cache at /project/jonmay_231/jacqueline/.cache/dea67b44b38d504f2523f3ddb6acb601b23d67bee52c942da336fa1283100990.94cae8b3a8dbab1d59b9d4827f7ce79e73124efa6bb970412cd503383a95f373 [INFO|configuration_utils.py:481] 2023-01-24 05:28:26,850 >> Model config RobertaConfig { "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "eos_token_id": 2, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 } [INFO|configuration_utils.py:445] 2023-01-24 05:28:27,146 >> loading configuration file https://huggingface.co/roberta-large/resolve/main/config.json from cache at /project/jonmay_231/jacqueline/.cache/dea67b44b38d504f2523f3ddb6acb601b23d67bee52c942da336fa1283100990.94cae8b3a8dbab1d59b9d4827f7ce79e73124efa6bb970412cd503383a95f373 [INFO|configuration_utils.py:481] 2023-01-24 05:28:27,146 >> Model config RobertaConfig { "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "eos_token_id": 2, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 } [INFO|tokenization_utils_base.py:1766] 2023-01-24 05:28:27,978 >> loading file https://huggingface.co/roberta-large/resolve/main/vocab.json from cache at /project/jonmay_231/jacqueline/.cache/7c1ba2435b05451bc3b4da073c8dec9630b22024a65f6c41053caccf2880eb8f.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab [INFO|tokenization_utils_base.py:1766] 2023-01-24 05:28:27,978 >> loading file https://huggingface.co/roberta-large/resolve/main/merges.txt from cache at /project/jonmay_231/jacqueline/.cache/20b5a00a80e27ae9accbe25672aba42ad2d4d4cb2c4b9359b50ca8e34e107d6d.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b [INFO|tokenization_utils_base.py:1766] 2023-01-24 05:28:27,979 >> loading file https://huggingface.co/roberta-large/resolve/main/tokenizer.json from cache at /project/jonmay_231/jacqueline/.cache/e16a2590deb9e6d73711d6e05bf27d832fa8c1162d807222e043ca650a556964.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730 [INFO|modeling_utils.py:1027] 2023-01-24 05:28:28,614 >> loading weights file https://huggingface.co/roberta-large/resolve/main/pytorch_model.bin from cache at /project/jonmay_231/jacqueline/.cache/8e36ec2f5052bec1e79e139b84c2c3089cb647694ba0f4f634fec7b8258f7c89.c43841d8c5cd23c435408295164cda9525270aa42cd0cc9200911570c0342352 Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMabel: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] - This IS expected if you are initializing RobertaForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForMabel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMabel: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] - This IS expected if you are initializing RobertaForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMabel: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] - This IS expected if you are initializing RobertaForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForMabel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:1135] 2023-01-24 05:29:01,118 >> Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMabel: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] - This IS expected if you are initializing RobertaForMabel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForMabel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForMabel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:1146] 2023-01-24 05:29:01,123 >> Some weights of RobertaForMabel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias', 'mlp.dense1.weight', 'mlp.dense1.bias', 'mlp.dense2.weight', 'mlp.dense2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:445] 2023-01-24 05:29:01,412 >> loading configuration file https://huggingface.co/roberta-large/resolve/main/config.json from cache at /project/jonmay_231/jacqueline/.cache/dea67b44b38d504f2523f3ddb6acb601b23d67bee52c942da336fa1283100990.94cae8b3a8dbab1d59b9d4827f7ce79e73124efa6bb970412cd503383a95f373 [INFO|configuration_utils.py:481] 2023-01-24 05:29:01,413 >> Model config RobertaConfig { "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "eos_token_id": 2, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.2.1", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 } [INFO|modeling_utils.py:1027] 2023-01-24 05:29:01,693 >> loading weights file https://huggingface.co/roberta-large/resolve/main/pytorch_model.bin from cache at /project/jonmay_231/jacqueline/.cache/8e36ec2f5052bec1e79e139b84c2c3089cb647694ba0f4f634fec7b8258f7c89.c43841d8c5cd23c435408295164cda9525270aa42cd0cc9200911570c0342352 Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. WARNING:datasets.fingerprint:Parameter 'function'=.prepare_features at 0x7f6c09a9a830> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /project/jonmay_231/jacqueline/.cache/csv/default-2f6794b69ce47e79/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-6b01a1c12a3a2107.arrow Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:1143] 2023-01-24 05:29:33,813 >> All model checkpoint weights were used when initializing RobertaForMaskedLM. Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:1146] 2023-01-24 05:29:33,833 >> Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. WARNING:datasets.fingerprint:Parameter 'function'=.prepare_features at 0x7fe9c9381830> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. WARNING:datasets.fingerprint:Parameter 'function'=.prepare_features at 0x7f3a1c382830> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. WARNING:datasets.fingerprint:Parameter 'function'=.prepare_features at 0x7f90f406b830> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /project/jonmay_231/jacqueline/.cache/csv/default-2f6794b69ce47e79/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-6b01a1c12a3a2107.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /project/jonmay_231/jacqueline/.cache/csv/default-2f6794b69ce47e79/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-6b01a1c12a3a2107.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /project/jonmay_231/jacqueline/.cache/csv/default-2f6794b69ce47e79/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-6b01a1c12a3a2107.arrow Dataset({ features: ['bin_mask', 'input_ids', 'attention_mask'], num_rows: 142158 })Dataset({ features: ['bin_mask', 'input_ids', 'attention_mask'], num_rows: 142158 })Dataset({ features: ['bin_mask', 'input_ids', 'attention_mask'], num_rows: 142158 })Dataset({ features: ['bin_mask', 'input_ids', 'attention_mask'], num_rows: 142158 }) [INFO|trainer.py:442] 2023-01-24 05:29:44,654 >> The following columns in the training set don't have a corresponding argument in `RobertaForMabel.forward` and have been ignored: . [INFO|trainer.py:358] 2023-01-24 05:29:44,654 >> Using amp fp16 backend 0%| | 0/2222 [00:00> Saving model checkpoint to /project/jonmay_231/jacqueline/mabel/training/rob-large-out/mabel-joint-cl-al1-mlm-bs-32-lr-2e-4-msl-128-ep-2-atemp-0.025 [INFO|configuration_utils.py:300] 2023-01-24 06:34:58,941 >> Configuration saved in /project/jonmay_231/jacqueline/mabel/training/rob-large-out/mabel-joint-cl-al1-mlm-bs-32-lr-2e-4-msl-128-ep-2-atemp-0.025/config.json [INFO|modeling_utils.py:817] 2023-01-24 06:35:02,197 >> Model weights saved in /project/jonmay_231/jacqueline/mabel/training/rob-large-out/mabel-joint-cl-al1-mlm-bs-32-lr-2e-4-msl-128-ep-2-atemp-0.025/pytorch_model.bin ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *****************************************