======================== START TIME: Tue Jul 2 15:28:18 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0702 15:28:27.105000 139655573002048 torch/distributed/run.py:757] W0702 15:28:27.105000 139655573002048 torch/distributed/run.py:757] ***************************************** W0702 15:28:27.105000 139655573002048 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 15:28:27.105000 139655573002048 torch/distributed/run.py:757] ***************************************** W0702 15:28:27.239000 140171220817728 torch/distributed/run.py:757] W0702 15:28:27.239000 140171220817728 torch/distributed/run.py:757] ***************************************** W0702 15:28:27.239000 140171220817728 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 15:28:27.239000 140171220817728 torch/distributed/run.py:757] ***************************************** [default0]:07/02/2024 15:28:50 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Vocab Size Padding] Padded vocab (size: 50257) with 7 dummy tokens (new size: 50264) [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Config: [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: run='%date_%jobid', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: seed=42, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: step=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: consumed_train_samples=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: benchmark_csv_path=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: ignore_sanity_checks=True), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: parallelism=ParallelismArgs(dp=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pp=2, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tp=8, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pp_engine=, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tp_mode=, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tp_linear_async_communication=False, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: expert_parallel_size=1), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: eos_token_id=2, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hidden_act='silu', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hidden_size=2048, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: initializer_range=0.02, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: intermediate_size=4096, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: is_llama_config=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: max_position_embeddings=4096, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_attention_heads=32, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_hidden_layers=24, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_key_value_heads=32, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pad_token_id=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pretraining_tp=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rms_norm_eps=1e-05, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rope_scaling=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rope_theta=10000.0, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tie_word_embeddings=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: use_cache=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: vocab_size=50264), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: init_method=RandomInit(std=0.025), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: dtype=torch.bfloat16, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: make_vocab_size_divisible_by=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: ddp_bucket_cap_mb=25), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tokenizer_revision=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tokenizer_max_length=None), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: checkpoint_interval=100000, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: save_initial_state=False, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: resume_checkpoint_path=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: checkpoints_path_is_shared_file_system=False), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: logging=LoggingArgs(log_level='info', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: log_level_replica='info', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: iteration_step_info_interval=1), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tokens=TokensArgs(sequence_length=4096, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: train_steps=20, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: micro_batch_size=2, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: batch_accumulation_per_replica=512, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: val_check_interval=-1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: limit_val_batches=0, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: limit_test_batches=0), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: adam_beta1=0.9, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: adam_beta2=0.95, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: torch_adam_is_fused=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: name='adamW'), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: zero_stage=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: weight_decay=0.01, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: clip_grad=1.0, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: accumulate_grad_in_fp32=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lr_warmup_steps=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lr_warmup_style='linear', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lr_decay_style='linear', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lr_decay_steps=19, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lr_decay_starting_step=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: min_decay_lr=1e-05)), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: start_training_step=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hf_dataset_splits='train', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hf_dataset_config_name=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: dataset_processing_num_proc_per_process=64, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: dataset_overwrite_cache=False, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: text_column_name='text'), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: seed=42, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_loading_workers=32))], [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/16_GPUS/dp-1_tp-8_pp-2_mbz-2')), [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: lighteval=None) [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Model Config: [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: LlamaConfig(bos_token_id=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: eos_token_id=2, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hidden_act='silu', [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: hidden_size=2048, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: initializer_range=0.02, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: intermediate_size=4096, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: is_llama_config=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: max_position_embeddings=4096, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_attention_heads=32, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_hidden_layers=24, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: num_key_value_heads=32, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pad_token_id=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: pretraining_tp=1, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rms_norm_eps=1e-05, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rope_scaling=None, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: rope_theta=10000.0, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: tie_word_embeddings=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: use_cache=True, [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: vocab_size=50264) [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Building model.. [default0]:07/02/2024 15:28:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Setting PP block ranks... [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=4|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=4|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=4|ip-26-0-163-43]: No checkpoint path provided. [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=7|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=7|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=7|ip-26-0-163-43]: No checkpoint path provided. [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-43]: No checkpoint path provided. [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-43]: No checkpoint path provided. [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-43]: No checkpoint path provided. [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Total number of parameters: 1.21G (2314.22MiB) [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: No checkpoint path provided. [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Parametrizing model parameters using StandardParametrizator [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=5|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=5|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=5|ip-26-0-163-43]: No checkpoint path provided. [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=6|ip-26-0-163-43]: Local number of parameters: 86.3M (164.65MiB) [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=6|ip-26-0-163-43]: [After model building] Memory usage: 179.67MiB. Peak allocated: 181.70MiB Peak reserved: 198.00MiB [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=0|TP=6|ip-26-0-163-43]: No checkpoint path provided. [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=4|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=7|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=7|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=4|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default3]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-207]: No checkpoint path provided. [default4]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=4|ip-26-0-169-207]: No checkpoint path provided. [default7]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=7|ip-26-0-169-207]: No checkpoint path provided. [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=5|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=6|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=6|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default6]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=6|ip-26-0-169-207]: No checkpoint path provided. [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=5|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default5]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=5|ip-26-0-169-207]: No checkpoint path provided. [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default1]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-207]: No checkpoint path provided. [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-207]: Local number of parameters: 65.3M (124.62MiB) [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-207]: [After model building] Memory usage: 135.64MiB. Peak allocated: 137.67MiB Peak reserved: 150.00MiB [default0]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default2]:07/02/2024 15:29:07 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-207]: No checkpoint path provided. [default0]:07/02/2024 15:29:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/02/2024 15:29:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/02/2024 15:29:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [ZeRO sharding] DP Rank 0 has 86.3M out of 86.3M (100.00%) params' optimizer states [default0]:07/02/2024 15:29:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/02/2024 15:29:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Using `datasets` library [default0]:07/02/2024 15:29:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/02/2024 15:29:10 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Training Plan] There are 1 training stages [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Stage Training Stage] start from step 1 [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: [Start training] datetime: 2024-07-02 15:29:12.643268 | mbs: 2 | grad_accum: 512 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/02/2024 15:29:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 839.67MiB. Peak allocated 839.67MiB. Peak reserved: 858.00MiB [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=2|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=4|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=1|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=3|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=6|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=5|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=4|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=7|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=3|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=1|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=6|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=2|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 15:29:12 [WARNING|DP=0|PP=0|TP=7|ip-26-0-163-43]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 15:29:12 [WARNING|DP=0|PP=1|TP=5|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 15:29:13 [WARNING|DP=0|PP=1|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default1]: warnings.warn( [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default5]: warnings.warn( [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default7]: warnings.warn( [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default3]: warnings.warn( [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default4]: warnings.warn( [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default5]: warnings.warn( [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default6]: warnings.warn( [default2]: warnings.warn( [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default3]: warnings.warn( [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default7]: warnings.warn( [default4]: warnings.warn( [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default6]: warnings.warn( [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default0]: warnings.warn( [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default1]: warnings.warn( [default0]:07/02/2024 15:30:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 910.74MiB. Peak allocated 4811.24MiB. Peak reserved: 4932.00MiB [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default2]: warnings.warn( [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2261: UserWarning: torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions [default0]: warnings.warn( [default0]:07/02/2024 15:30:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 1570.39MiB. Peak reserved: 4932.00MiB [default0]:07/02/2024 15:30:26 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 1 / 20 | consumed_tokens: 4.19M | elapsed_time_per_iteration_ms: 71.7K | tokens_per_sec: 58.5K | tokens_per_sec_per_gpu: 3.66K | global_batch_size: 1.02K | lm_loss: 11.2 | lr: 0.0001 | model_tflops_per_gpu: 33.2 | hardware_tflops_per_gpu: 33.2 | grad_norm: 12.1 | cuda_memory_allocated: 1.26G | cuda_max_memory_reserved: 2.94G | hd_total_memory_tb: 312G | hd_used_memory_tb: 65.5G | hd_free_memory_tb: 247G [default0]:07/02/2024 15:31:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 5316.30MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:31:14 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 2 / 20 | consumed_tokens: 8.39M | elapsed_time_per_iteration_ms: 47.8K | tokens_per_sec: 87.8K | tokens_per_sec_per_gpu: 5.49K | global_batch_size: 1.02K | lm_loss: 11.2 | lr: 9.53e-05 | model_tflops_per_gpu: 49.8 | hardware_tflops_per_gpu: 49.8 | grad_norm: 12.2 | cuda_memory_allocated: 1.26G | cuda_max_memory_reserved: 3.36G | hd_total_memory_tb: 312G | hd_used_memory_tb: 65.5G | hd_free_memory_tb: 247G [default0]:07/02/2024 15:31:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 1570.43MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:32:05 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 3 / 20 | consumed_tokens: 12.6M | elapsed_time_per_iteration_ms: 50.7K | tokens_per_sec: 82.7K | tokens_per_sec_per_gpu: 5.17K | global_batch_size: 1.02K | lm_loss: 10 | lr: 9.05e-05 | model_tflops_per_gpu: 46.9 | hardware_tflops_per_gpu: 46.9 | grad_norm: 51.6 | cuda_memory_allocated: 1.26G | cuda_max_memory_reserved: 3.36G | hd_total_memory_tb: 312G | hd_used_memory_tb: 65.5G | hd_free_memory_tb: 247G [default0]:07/02/2024 15:32:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 5316.30MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:32:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 1570.43MiB. Peak reserved: 5404.00MiB [default0]:STAGE:2024-07-02 15:32:05 645199:645199 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [default0]:07/02/2024 15:33:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 5316.30MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:33:13 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 4 / 20 | consumed_tokens: 16.8M | elapsed_time_per_iteration_ms: 68.1K | tokens_per_sec: 61.6K | tokens_per_sec_per_gpu: 3.85K | global_batch_size: 1.02K | lm_loss: 11.7 | lr: 8.58e-05 | model_tflops_per_gpu: 34.9 | hardware_tflops_per_gpu: 34.9 | grad_norm: 18.3 | cuda_memory_allocated: 1.26G | cuda_max_memory_reserved: 3.36G | hd_total_memory_tb: 312G | hd_used_memory_tb: 65.5G | hd_free_memory_tb: 247G [default0]:07/02/2024 15:33:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 1570.43MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:34:23 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 5 / 20 | consumed_tokens: 21M | elapsed_time_per_iteration_ms: 69.7K | tokens_per_sec: 60.2K | tokens_per_sec_per_gpu: 3.76K | global_batch_size: 1.02K | lm_loss: 10.4 | lr: 8.11e-05 | model_tflops_per_gpu: 34.1 | hardware_tflops_per_gpu: 34.1 | grad_norm: 16 [default0]:07/02/2024 15:34:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-43]: Memory usage: 1570.39MiB. Peak allocated 5316.30MiB. Peak reserved: 5404.00MiB [default0]:07/02/2024 15:35:32 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-207]: iteration: 6 / 20 | consumed_tokens: 25.2M | elapsed_time_per_iteration_ms: 69.1K | tokens_per_sec: 60.7K | tokens_per_sec_per_gpu: 3.79K | global_batch_size: 1.02K | lm_loss: 9.9 | lr: 7.63e-05 | model_tflops_per_gpu: 34.4 | hardware_tflops_per_gpu: 34.4 | grad_norm: 9.07 [default0]:STAGE:2024-07-02 15:38:37 645199:645199 ActivityProfilerController.cpp:320] Completed Stage: Collection [default0]:STAGE:2024-07-02 15:38:58 645199:645199 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [default2]:[rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank9]: dist.recv( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank10]: recv_activation_tensor = recv_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank12]: recv_activation_tensor = recv_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank12]: dist.recv( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank8]: pipeline_state.run_communication() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank8]: recv_activation_tensor = recv_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank8]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c21616897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9c228efc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c228f4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c228f5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f9c6e38ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f9c733d5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f9c731a0353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c21616897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9c228efc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c228f4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c228f5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f9c6e38ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f9c733d5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f9c731a0353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c21616897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f9c22579119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f9c6e38ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f9c733d5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f9c731a0353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88de3d4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f88df6adc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f88df6b2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f88df6b3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f892b14ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f8930193609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f892ff5e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88de3d4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f88df6adc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f88df6b2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f88df6b3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f892b14ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f8930193609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f892ff5e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88de3d4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f88df337119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f892b14ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f8930193609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f892ff5e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank11]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff9487b5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff949a8ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff949a93a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff949a94dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff99552de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff99a574609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7ff99a33f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff9487b5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff949a8ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff949a93a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff949a94dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff99552de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff99a574609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7ff99a33f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff9487b5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7ff949718119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7ff99552de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7ff99a574609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7ff99a33f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84bd52d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84be806c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84be80ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84be80cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f850a2a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f850f2ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f850f0b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84bd52d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84be806c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84be80ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84be80cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f850a2a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f850f2ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f850f0b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84bd52d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f84be490119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f850a2a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f850f2ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f850f0b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a5b553897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a5c82cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a5c831a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a5c832dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f5aa82cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f5aad312609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f5aad0dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a5b553897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a5c82cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a5c831a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a5c832dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f5aa82cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f5aad312609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f5aad0dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a5b553897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f5a5c4b6119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f5aa82cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f5aad312609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f5aad0dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efdea5b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efdeb889c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efdeb88ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efdeb88fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7efe37328e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7efe3c36f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7efe3c13a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efdea5b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efdeb889c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efdeb88ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efdeb88fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7efe37328e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7efe3c36f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7efe3c13a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efdea5b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7efdeb513119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7efe37328e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7efe3c36f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7efe3c13a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2bb974897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd2bcc4dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd2bcc52a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd2bcc53dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fd3086ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fd30d733609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fd30d4fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2bb974897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd2bcc4dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd2bcc52a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd2bcc53dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fd3086ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fd30d733609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fd30d4fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2bb974897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7fd2bc8d7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7fd3086ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7fd30d733609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7fd30d4fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 27658, last enqueued NCCL work: 27658, last completed NCCL work: 27657. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f268b303897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f268c5dcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f268c5e1a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f268c5e2dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f26d807be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f26dd0c2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f26dce8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27658, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f268b303897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f268c5dcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f268c5e1a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f268c5e2dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f26d807be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f26dd0c2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f26dce8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f268b303897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f268c266119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f26d807be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f26dd0c2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f26dce8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: E0702 15:45:39.534000 140171220817728 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 2214511) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 9 (local_rank: 1) exitcode : -6 (pid: 2214512) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214512 [2]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 10 (local_rank: 2) exitcode : -6 (pid: 2214513) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214513 [3]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 11 (local_rank: 3) exitcode : -6 (pid: 2214514) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214514 [4]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 12 (local_rank: 4) exitcode : -6 (pid: 2214515) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214515 [5]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 13 (local_rank: 5) exitcode : -6 (pid: 2214516) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214516 [6]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 14 (local_rank: 6) exitcode : -6 (pid: 2214517) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214517 [7]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 15 (local_rank: 7) exitcode : -6 (pid: 2214518) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214518 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_15:45:39 host : ip-26-0-169-207.ec2.internal rank : 8 (local_rank: 0) exitcode : -6 (pid: 2214511) error_file: traceback : Signal 6 (SIGABRT) received by PID 2214511 ============================================================ srun: error: ip-26-0-169-207: task 1: Exited with exit code 1 [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default4]:[rank4]: send_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank4]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank4]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default4]:[rank4]: dist.isend( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default4]:[rank4]: return pg.send([tensor], dst, tag) [default4]:[rank4]: RuntimeError: Unconvertible NCCL type [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default1]:[rank1]: send_activation() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank1]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank1]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default1]:[rank1]: dist.isend( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default1]:[rank1]: return pg.send([tensor], dst, tag) [default1]:[rank1]: RuntimeError: Unconvertible NCCL type [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default5]:[rank5]: send_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank5]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank5]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default5]:[rank5]: dist.isend( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default5]:[rank5]: return pg.send([tensor], dst, tag) [default5]:[rank5]: RuntimeError: Unconvertible NCCL type [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default7]:[rank7]: send_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank7]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank7]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default7]:[rank7]: dist.isend( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default7]:[rank7]: return pg.send([tensor], dst, tag) [default7]:[rank7]: RuntimeError: Unconvertible NCCL type [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default2]:[rank2]: send_activation() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank2]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank2]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default2]:[rank2]: dist.isend( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default2]:[rank2]: return pg.send([tensor], dst, tag) [default2]:[rank2]: RuntimeError: Unconvertible NCCL type [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default6]:[rank6]: send_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank6]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank6]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default6]:[rank6]: dist.isend( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default6]:[rank6]: return pg.send([tensor], dst, tag) [default6]:[rank6]: RuntimeError: Unconvertible NCCL type [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 258, in train_batch_iter [default3]:[rank3]: send_activation() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank3]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank3]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 306, in isend_tensors [default3]:[rank3]: dist.isend( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1809, in isend [default3]:[rank3]: return pg.send([tensor], dst, tag) [default3]:[rank3]: RuntimeError: Unconvertible NCCL type [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 6] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f062e4fe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f062f7d7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f062f7dca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f062f7dddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f067b276e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f06802bd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f0680088353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 2 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f062e4fe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f062f7d7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f062f7dca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f062f7dddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f067b276e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f06802bd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f0680088353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f062e4fe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f062f461119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f067b276e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f06802bd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f0680088353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 5] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5c742a8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5c75581c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5c75586a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5c75587dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f5cc1020e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f5cc6067609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f5cc5e32353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 2 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5c742a8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5c75581c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5c75586a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5c75587dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f5cc1020e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f5cc6067609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f5cc5e32353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5c742a8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f5c7520b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f5cc1020e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f5cc6067609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f5cc5e32353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 3] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7af072d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7af1a06c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7af1a0ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7af1a0cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f7b3d4a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f7b424ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f7b422b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7af072d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7af1a06c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7af1a0ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7af1a0cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f7b3d4a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f7b424ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f7b422b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7af072d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f7af1690119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f7b3d4a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f7b424ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f7b422b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 4] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb352527897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb353800c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb353805a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb353806dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fb39f29fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fb3a42e6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fb3a40b1353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 2 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb352527897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb353800c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb353805a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb353806dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fb39f29fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fb3a42e6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fb3a40b1353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb352527897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7fb35348a119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7fb39f29fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7fb3a42e6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7fb3a40b1353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 7] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2320a2c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2321d05c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2321d0aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2321d0bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f236d7a4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f23727eb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f23725b6353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2320a2c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2321d05c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2321d0aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2321d0bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f236d7a4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f23727eb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f23725b6353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2320a2c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f232198f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f236d7a4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f23727eb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f23725b6353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 2] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd457045897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd45831ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd458323a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd458324dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fd4a3dbde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fd4a8e04609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fd4a8bcf353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd457045897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd45831ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd458323a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd458324dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fd4a3dbde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fd4a8e04609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fd4a8bcf353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd457045897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fd457fa8119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fd4a3dbde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fd4a8e04609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fd4a8bcf353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 1] Timeout at NCCL work: 350245, last enqueued NCCL work: 350301, last completed NCCL work: 350244. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89ea810897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f89ebae9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f89ebaeea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f89ebaefdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f8a37588e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f8a3c5cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f8a3c39a353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=350245, OpType=_REDUCE_SCATTER_BASE, NumelIn=16777216, NumelOut=2097152, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89ea810897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f89ebae9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f89ebaeea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f89ebaefdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f8a37588e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f8a3c5cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f8a3c39a353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89ea810897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f89eb773119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f8a37588e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f8a3c5cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f8a3c39a353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: W0702 15:45:49.409000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645199 closing signal SIGTERM W0702 15:45:49.412000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645200 closing signal SIGTERM W0702 15:45:49.412000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645201 closing signal SIGTERM W0702 15:45:49.414000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645202 closing signal SIGTERM W0702 15:45:49.414000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645203 closing signal SIGTERM W0702 15:45:49.414000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 645206 closing signal SIGTERM E0702 15:45:57.084000 139655573002048 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 5 (pid: 645204) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_15:45:49 host : ip-26-0-163-43.ec2.internal rank : 6 (local_rank: 6) exitcode : -6 (pid: 645205) error_file: traceback : Signal 6 (SIGABRT) received by PID 645205 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_15:45:49 host : ip-26-0-163-43.ec2.internal rank : 5 (local_rank: 5) exitcode : -6 (pid: 645204) error_file: traceback : Signal 6 (SIGABRT) received by PID 645204 ============================================================ srun: error: ip-26-0-163-43: task 0: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.