======================== START TIME: Wed Jul 3 16:20:56 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 16:20:58.767000 140462433310528 torch/distributed/run.py:757] W0703 16:20:58.767000 140462433310528 torch/distributed/run.py:757] ***************************************** W0703 16:20:58.767000 140462433310528 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 16:20:58.767000 140462433310528 torch/distributed/run.py:757] ***************************************** W0703 16:20:58.769000 140247216203584 torch/distributed/run.py:757] W0703 16:20:58.769000 140247216203584 torch/distributed/run.py:757] ***************************************** W0703 16:20:58.769000 140247216203584 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 16:20:58.769000 140247216203584 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config: [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: run='%date_%jobid', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: step=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: consumed_train_samples=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: benchmark_csv_path=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ignore_sanity_checks=True), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: parallelism=ParallelismArgs(dp=2, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp=8, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp_engine=, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_mode=, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_linear_async_communication=False, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: expert_parallel_size=1), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50257), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dtype=torch.bfloat16, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_revision=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_max_length=None), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoint_interval=100000, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: save_initial_state=False, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: resume_checkpoint_path=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: log_level_replica='info', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: iteration_step_info_interval=1), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: train_steps=20, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: micro_batch_size=64, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: batch_accumulation_per_replica=8, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: val_check_interval=-1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_val_batches=0, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_test_batches=0), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta1=0.9, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta2=0.95, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: torch_adam_is_fused=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: name='adamW'), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: zero_stage=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: weight_decay=0.01, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: clip_grad=1.0, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_steps=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_style='linear', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_style='linear', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_steps=19, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_starting_step=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: min_decay_lr=1e-05)), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: start_training_step=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_splits='train', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_config_name=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_overwrite_cache=False, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: text_column_name='text'), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_loading_workers=0))], [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/tmp/remove/llama-1B/16_GPUS/dp-2_tp-1_pp-8_mbz-64')), [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lighteval=None) [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Model Config: [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50257) [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Building model.. [default0]:07/03/2024 16:21:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Setting PP block ranks... [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: Local number of parameters: 126M (240.02MiB) [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-134]: Local number of parameters: 103M (196.32MiB) [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 196.33MiB. Peak allocated: 196.34MiB Peak reserved: 200.00MiB [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-134]: Local number of parameters: 168M (320.03MiB) [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 324.04MiB. Peak allocated: 326.07MiB Peak reserved: 336.00MiB [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: Local number of parameters: 126M (240.02MiB) [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-225]: Local number of parameters: 126M (240.02MiB) [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default2]:07/03/2024 16:21:29 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=3|TP=0|ip-26-0-160-225]: Local number of parameters: 168M (320.03MiB) [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=3|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 324.04MiB. Peak allocated: 326.07MiB Peak reserved: 336.00MiB [default6]:07/03/2024 16:21:29 [INFO|DP=0|PP=3|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=2|TP=0|ip-26-0-160-225]: Local number of parameters: 126M (240.02MiB) [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=2|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Total number of parameters: 1.21G (2312.82MiB) [default4]:07/03/2024 16:21:29 [INFO|DP=0|PP=2|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Local number of parameters: 271M (516.35MiB) [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 520.36MiB. Peak allocated: 522.39MiB Peak reserved: 534.00MiB [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default0]:07/03/2024 16:21:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Parametrizing model parameters using StandardParametrizator [default3]:07/03/2024 16:21:30 [INFO|DP=1|PP=5|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default1]:07/03/2024 16:21:30 [INFO|DP=1|PP=4|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default5]:07/03/2024 16:21:30 [INFO|DP=1|PP=6|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default7]:07/03/2024 16:21:30 [INFO|DP=1|PP=7|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default5]:07/03/2024 16:21:30 [INFO|DP=1|PP=2|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default1]:07/03/2024 16:21:30 [INFO|DP=1|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default3]:07/03/2024 16:21:30 [INFO|DP=1|PP=1|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default7]:07/03/2024 16:21:30 [INFO|DP=1|PP=3|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 0 has 135M out of 271M (50.00%) params' optimizer states [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 1 has 135M out of 271M (50.00%) params' optimizer states [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Using `datasets` library [default0]:07/03/2024 16:21:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 16:21:34 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] There are 1 training stages [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Start training] datetime: 2024-07-03 16:21:34.818043 | mbs: 64 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 16:21:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Memory usage: 2069.40MiB. Peak allocated 2069.40MiB. Peak reserved: 2086.00MiB [default2]:07/03/2024 16:21:34 [WARNING|DP=0|PP=1|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 16:21:34 [WARNING|DP=1|PP=1|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 16:21:35 [WARNING|DP=0|PP=5|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 16:21:35 [WARNING|DP=1|PP=6|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 16:21:34 [WARNING|DP=0|PP=4|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 16:21:34 [WARNING|DP=0|PP=7|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 16:21:35 [WARNING|DP=0|PP=6|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 16:21:34 [WARNING|DP=1|PP=7|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 16:21:34 [WARNING|DP=0|PP=2|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 16:21:34 [WARNING|DP=0|PP=3|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 16:21:34 [WARNING|DP=1|PP=3|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 16:21:35 [WARNING|DP=1|PP=4|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 16:21:35 [WARNING|DP=1|PP=5|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 16:21:35 [WARNING|DP=1|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 16:21:35 [WARNING|DP=1|PP=2|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default0]:[rank0]: merged_states = self.gate_up_proj(hidden_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default0]:[rank0]: return column_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default0]:[rank0]: return F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 565, in forward [default1]:[rank1]: key_value_states = key_value_states.permute(1, 2, 0, 3, 4).contiguous() [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.76 GiB is free. Including non-PyTorch memory, this process has 77.55 GiB memory in use. Of the allocated memory 69.16 GiB is allocated by PyTorch, and 1.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank8]: pipeline_state.run_communication() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank8]: recv_activation_tensor = recv_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank8]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default0]:[rank8]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank8]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc1ff5ec897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank8]: frame #1: + 0x5b3a23e (0x7fc23910923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc239103c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc239103f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc239104fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc2390b9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc2390b9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc2390b9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc2390b9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc2008c6189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc2008cd610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc2008ec978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #12: + 0x5adc309 (0x7fc2390ab309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #13: + 0x5ae6f10 (0x7fc2390b5f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #14: + 0x5ae6fa5 (0x7fc2390b5fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #15: + 0x5124446 (0x7fc2386f3446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #16: + 0x1acf4b8 (0x7fc23509e4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #17: + 0x5aee004 (0x7fc2390bd004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #18: + 0x5af36b5 (0x7fc2390c26b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #19: + 0xd2631e (0x7fc24bcac31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: frame #20: + 0x47def4 (0x7fc24b403ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: frame #21: + 0x1445a6 (0x5594100c75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5594100c0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #23: + 0x150866 (0x5594100d3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5594100bc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5594100c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #26: PyObject_Call + 0xbc (0x5594100d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5594100ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5594100c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #30: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #32: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #34: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5594100bff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5594100d1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #38: + 0x211239 (0x559410194239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5594100c0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5594100bc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5594100c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5594100b7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5594100c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #45: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #46: PyObject_Call + 0xbc (0x5594100d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5594100ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #48: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #49: PyObject_Call + 0xbc (0x5594100d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5594100ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5594100c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5594100c0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5594100d1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #54: + 0x211239 (0x559410194239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #55: PyObject_Call + 0x207 (0x5594100d4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5594100ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #57: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5594100b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #59: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #60: PyObject_Call + 0xbc (0x5594100d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5594100ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #62: + 0x150582 (0x5594100d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #63: PyObject_Call + 0xbc (0x5594100d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default6]:[rank14]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default6]:[rank14]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank14]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7862bf7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank14]: frame #1: + 0x5b3a23e (0x7f789c71423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f789c70ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f789c70ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f789c70ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789c6c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789c6c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789c6c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789c6c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7863ed1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7863ed8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7863ef7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #12: + 0x5adc309 (0x7f789c6b6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #13: + 0x5ae6f10 (0x7f789c6c0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #14: + 0x5ae6fa5 (0x7f789c6c0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #15: + 0x5124446 (0x7f789bcfe446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #16: + 0x1acf4b8 (0x7f78986a94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #17: + 0x5aee004 (0x7f789c6c8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #18: + 0x5af36b5 (0x7f789c6cd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #19: + 0xd2631e (0x7f78af2b731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #20: + 0x47def4 (0x7f78aea0eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #21: + 0x1445a6 (0x5624860665a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56248605fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #23: + 0x150866 (0x562486072866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56248605b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #25: _PyFunction_Vectorcall + 0x6c (0x562486066a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #26: PyObject_Call + 0xbc (0x562486072f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5624860592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #28: _PyFunction_Vectorcall + 0x6c (0x562486066a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #30: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #32: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #34: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56248605ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #37: _PyObject_Call_Prepend + 0x69 (0x562486070c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #38: + 0x211239 (0x562486133239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56248605fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56248605b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #41: _PyFunction_Vectorcall + 0x6c (0x562486066a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562486056c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #43: _PyFunction_Vectorcall + 0x6c (0x562486066a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #45: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #46: PyObject_Call + 0xbc (0x562486072f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5624860592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #48: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #49: PyObject_Call + 0xbc (0x562486072f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5624860592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #51: _PyFunction_Vectorcall + 0x6c (0x562486066a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56248605f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #53: _PyObject_Call_Prepend + 0x69 (0x562486070c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #54: + 0x211239 (0x562486133239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #55: _PyObject_MakeTpCall + 0x26b (0x56248605fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x56248605bc53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #57: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5624860578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #59: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #60: PyObject_Call + 0xbc (0x562486072f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5624860592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #62: + 0x150582 (0x562486072582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #63: PyObject_Call + 0xbc (0x562486072f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f06cc1ba897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: frame #1: + 0x5b3a23e (0x7f0705cd723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0705cd1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0705cd1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0705cd2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0705c87371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0705c87371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0705c87371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0705c87371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f06cd494189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f06cd49b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f06cd4ba978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #12: + 0x5adc309 (0x7f0705c79309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #13: + 0x5ae6f10 (0x7f0705c83f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #14: + 0x5ae6fa5 (0x7f0705c83fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #15: + 0x5124446 (0x7f07052c1446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #16: + 0x1acf4b8 (0x7f0701c6c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #17: + 0x5aee004 (0x7f0705c8b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #18: + 0x5af36b5 (0x7f0705c906b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #19: + 0xd2631e (0x7f071887a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #20: + 0x47def4 (0x7f0717fd1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #21: + 0x1445a6 (0x5595877df5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5595877d8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #23: + 0x150866 (0x5595877eb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5595877d4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5595877dfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x5595877ebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5595877d22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5595877dfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #30: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #32: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #34: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5595877d7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5595877e9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #38: + 0x211239 (0x5595878ac239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5595877d8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5595877d43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5595877dfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5595877cfc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5595877dfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #45: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x5595877ebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5595877d22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #48: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x5595877ebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5595877d22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5595877dfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5595877d8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5595877e9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #54: + 0x211239 (0x5595878ac239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x5595877ec067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5595877d22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #57: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5595877d08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #59: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x5595877ebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5595877d22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #62: + 0x150582 (0x5595877eb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x5595877ebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank2]: pipeline_state.run_communication() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank2]: recv_activation_tensor = recv_activation() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank2]: dist.recv( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank2]: return func(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank2]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank2]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank2]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b7abca897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank2]: frame #1: + 0x5b3a23e (0x7f1bb46e723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1bb46e1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1bb46e1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1bb46e2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1bb4697371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1bb4697371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1bb4697371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1bb4697371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1b7bea4189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1b7beab610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1b7beca978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #12: + 0x5adc309 (0x7f1bb4689309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #13: + 0x5ae6f10 (0x7f1bb4693f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #14: + 0x5ae6fa5 (0x7f1bb4693fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #15: + 0x5124446 (0x7f1bb3cd1446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #16: + 0x1acf4b8 (0x7f1bb067c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #17: + 0x5aee004 (0x7f1bb469b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #18: + 0x5af36b5 (0x7f1bb46a06b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #19: + 0xd2631e (0x7f1bc728a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank2]: frame #20: + 0x47def4 (0x7f1bc69e1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank2]: frame #21: + 0x1445a6 (0x556ac02fc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #22: _PyObject_MakeTpCall + 0x26b (0x556ac02f5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #23: + 0x150866 (0x556ac0308866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556ac02f1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #25: _PyFunction_Vectorcall + 0x6c (0x556ac02fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #26: PyObject_Call + 0xbc (0x556ac0308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556ac02ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #28: _PyFunction_Vectorcall + 0x6c (0x556ac02fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #30: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #32: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #34: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556ac02f4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #37: _PyObject_Call_Prepend + 0x69 (0x556ac0306c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #38: + 0x211239 (0x556ac03c9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #39: _PyObject_MakeTpCall + 0x26b (0x556ac02f5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556ac02f13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #41: _PyFunction_Vectorcall + 0x6c (0x556ac02fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556ac02ecc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #43: _PyFunction_Vectorcall + 0x6c (0x556ac02fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #45: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #46: PyObject_Call + 0xbc (0x556ac0308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556ac02ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #48: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #49: PyObject_Call + 0xbc (0x556ac0308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556ac02ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #51: _PyFunction_Vectorcall + 0x6c (0x556ac02fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556ac02f5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #53: _PyObject_Call_Prepend + 0x69 (0x556ac0306c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #54: + 0x211239 (0x556ac03c9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #55: PyObject_Call + 0x207 (0x556ac0309067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556ac02ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #57: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556ac02ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #59: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #60: PyObject_Call + 0xbc (0x556ac0308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556ac02ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #62: + 0x150582 (0x556ac0308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #63: PyObject_Call + 0xbc (0x556ac0308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank4]: dist.recv( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb1c9ad3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank4]: frame #1: + 0x5b3a23e (0x7fb2035f023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb2035eac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb2035eaf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb2035ebfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2035a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2035a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2035a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2035a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb1cadad189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb1cadb4610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb1cadd3978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #12: + 0x5adc309 (0x7fb203592309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #13: + 0x5ae6f10 (0x7fb20359cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #14: + 0x5ae6fa5 (0x7fb20359cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #15: + 0x5124446 (0x7fb202bda446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #16: + 0x1acf4b8 (0x7fb1ff5854b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #17: + 0x5aee004 (0x7fb2035a4004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #18: + 0x5af36b5 (0x7fb2035a96b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #19: + 0xd2631e (0x7fb21619331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #20: + 0x47def4 (0x7fb2158eaef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #21: + 0x1445a6 (0x561fc4f955a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561fc4f8ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #23: + 0x150866 (0x561fc4fa1866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561fc4f8a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561fc4f95a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x561fc4fa1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561fc4f882b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561fc4f95a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #30: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #32: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #34: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561fc4f8df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561fc4f9fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #38: + 0x211239 (0x561fc5062239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561fc4f8ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561fc4f8a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561fc4f95a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561fc4f85c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561fc4f95a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #45: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x561fc4fa1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561fc4f882b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #48: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x561fc4fa1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561fc4f882b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561fc4f95a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561fc4f8e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561fc4f9fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #54: + 0x211239 (0x561fc5062239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x561fc4fa2067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561fc4f882b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #57: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561fc4f868fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #59: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x561fc4fa1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561fc4f882b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #62: + 0x150582 (0x561fc4fa1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x561fc4fa1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: return forward_call(*args, **kwargs) [default3]:[rank3]: recv_activation_tensor = recv_activation() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank6]: dist.recv( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank3]: dist.recv( [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5abb0c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank3]: return func(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank3]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank3]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank3]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8066d82897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank3]: frame #1: + 0x5b3a23e (0x7f80a089f23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #1: + 0x5b3a23e (0x7fb5e562923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb5e5623c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f80a0899c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb5e5623f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb5e5624fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e55d9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e55d9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e55d9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f80a0899f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f80a089afd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80a084f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e55d9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80a084f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb5acde6189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80a084f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb5acded610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb5ace0c978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80a084f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #12: + 0x5adc309 (0x7fb5e55cb309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #13: + 0x5ae6f10 (0x7fb5e55d5f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f806805c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8068063610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8068082978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #14: + 0x5ae6fa5 (0x7fb5e55d5fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #12: + 0x5adc309 (0x7f80a0841309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #13: + 0x5ae6f10 (0x7f80a084bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #15: + 0x5124446 (0x7fb5e4c13446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #16: + 0x1acf4b8 (0x7fb5e15be4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #14: + 0x5ae6fa5 (0x7f80a084bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #15: + 0x5124446 (0x7f809fe89446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #17: + 0x5aee004 (0x7fb5e55dd004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #18: + 0x5af36b5 (0x7fb5e55e26b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #16: + 0x1acf4b8 (0x7f809c8344b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #17: + 0x5aee004 (0x7f80a0853004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #18: + 0x5af36b5 (0x7f80a08586b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #19: + 0xd2631e (0x7f80b344231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank3]: frame #20: + 0x47def4 (0x7f80b2b99ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank3]: frame #21: + 0x1445a6 (0x55e8e83495a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #19: + 0xd2631e (0x7fb5f81cc31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #20: + 0x47def4 (0x7fb5f7923ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank3]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55e8e8342a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #23: + 0x150866 (0x55e8e8355866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #21: + 0x1445a6 (0x5587bce025a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5587bcdfba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e8e833e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55e8e8349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #26: PyObject_Call + 0xbc (0x55e8e8355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e8e833c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55e8e8349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #23: + 0x150866 (0x5587bce0e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5587bcdf7142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5587bce02a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #30: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x5587bce0ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #32: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5587bcdf52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #34: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e8e8341f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55e8e8353c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5587bce02a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #38: + 0x211239 (0x55e8e8416239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55e8e8342a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e8e833e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55e8e8349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e8e8339c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55e8e8349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #30: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #45: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #46: PyObject_Call + 0xbc (0x55e8e8355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #32: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #34: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5587bcdfaf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e8e833c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5587bce0cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #38: + 0x211239 (0x5587bcecf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #48: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: frame #49: PyObject_Call + 0xbc (0x55e8e8355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e8e833c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55e8e8349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e8e8342007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55e8e8353c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #54: + 0x211239 (0x55e8e8416239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: frame #55: PyObject_Call + 0x207 (0x55e8e8356067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5587bcdfba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e8e833c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #57: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e8e833a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5587bcdf73e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #59: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5587bce02a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5587bcdf2c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #60: PyObject_Call + 0xbc (0x55e8e8355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e8e833c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #62: + 0x150582 (0x55e8e8355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5587bce02a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #45: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x5587bce0ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #63: PyObject_Call + 0xbc (0x55e8e8355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5587bcdf52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #48: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x5587bce0ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5587bcdf52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5587bce02a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5587bcdfb007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5587bce0cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #54: + 0x211239 (0x5587bcecf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x5587bce0f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5587bcdf52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #57: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5587bcdf38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #59: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x5587bce0ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5587bcdf52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #62: + 0x150582 (0x5587bce0e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x5587bce0ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: recv_activation_tensor = recv_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f886d10e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank5]: frame #1: + 0x5b3a23e (0x7f88a6c2b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f88a6c25c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f88a6c25f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f88a6c26fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f88a6bdb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f88a6bdb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f88a6bdb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f88a6bdb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f886e3e8189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f886e3ef610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f886e40e978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #12: + 0x5adc309 (0x7f88a6bcd309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #13: + 0x5ae6f10 (0x7f88a6bd7f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #14: + 0x5ae6fa5 (0x7f88a6bd7fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #15: + 0x5124446 (0x7f88a6215446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #16: + 0x1acf4b8 (0x7f88a2bc04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #17: + 0x5aee004 (0x7f88a6bdf004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #18: + 0x5af36b5 (0x7f88a6be46b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #19: + 0xd2631e (0x7f88b97ce31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #20: + 0x47def4 (0x7f88b8f25ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #21: + 0x1445a6 (0x5641158db5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5641158d4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #23: + 0x150866 (0x5641158e7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5641158d0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5641158dba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x5641158e7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5641158ce2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5641158dba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #30: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #32: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #34: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5641158d3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5641158e5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #38: + 0x211239 (0x5641159a8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5641158d4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5641158d03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5641158dba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5641158cbc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5641158dba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #45: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x5641158e7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5641158ce2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #48: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x5641158e7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5641158ce2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5641158dba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5641158d4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5641158e5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #54: + 0x211239 (0x5641159a8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x5641158e8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5641158ce2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #57: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5641158cc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #59: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x5641158e7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5641158ce2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #62: + 0x150582 (0x5641158e7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x5641158e7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. W0703 16:21:39.873000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262598 closing signal SIGTERM W0703 16:21:39.873000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262599 closing signal SIGTERM W0703 16:21:39.874000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262600 closing signal SIGTERM W0703 16:21:39.874000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262601 closing signal SIGTERM W0703 16:21:39.874000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262602 closing signal SIGTERM W0703 16:21:39.875000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 262603 closing signal SIGTERM [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default5]:[rank13]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank13]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f930af1e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank13]: frame #1: + 0x5b3a23e (0x7f9344a3b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f9344a35c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9344a35f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9344a36fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93449eb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93449eb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93449eb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93449eb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f930c1f8189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f930c1ff610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f930c21e978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #12: + 0x5adc309 (0x7f93449dd309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #13: + 0x5ae6f10 (0x7f93449e7f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #14: + 0x5ae6fa5 (0x7f93449e7fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #15: + 0x5124446 (0x7f9344025446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #16: + 0x1acf4b8 (0x7f93409d04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #17: + 0x5aee004 (0x7f93449ef004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #18: + 0x5af36b5 (0x7f93449f46b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #19: + 0xd2631e (0x7f93575de31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #20: + 0x47def4 (0x7f9356d35ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #21: + 0x1445a6 (0x558086d0d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #22: _PyObject_MakeTpCall + 0x26b (0x558086d06a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #23: + 0x150866 (0x558086d19866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x558086d02142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #25: _PyFunction_Vectorcall + 0x6c (0x558086d0da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #26: PyObject_Call + 0xbc (0x558086d19f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x558086d002b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #28: _PyFunction_Vectorcall + 0x6c (0x558086d0da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #30: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #32: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #34: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558086d05f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #37: _PyObject_Call_Prepend + 0x69 (0x558086d17c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #38: + 0x211239 (0x558086dda239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #39: _PyObject_MakeTpCall + 0x26b (0x558086d06a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x558086d023e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #41: _PyFunction_Vectorcall + 0x6c (0x558086d0da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x558086cfdc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #43: _PyFunction_Vectorcall + 0x6c (0x558086d0da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #45: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #46: PyObject_Call + 0xbc (0x558086d19f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x558086d002b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #48: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #49: PyObject_Call + 0xbc (0x558086d19f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x558086d002b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #51: _PyFunction_Vectorcall + 0x6c (0x558086d0da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558086d06007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #53: _PyObject_Call_Prepend + 0x69 (0x558086d17c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #54: + 0x211239 (0x558086dda239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #55: PyObject_Call + 0x207 (0x558086d1a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x558086d002b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #57: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x558086cfe8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #59: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #60: PyObject_Call + 0xbc (0x558086d19f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x558086d002b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #62: + 0x150582 (0x558086d19582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #63: PyObject_Call + 0xbc (0x558086d19f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default7]:[rank15]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default7]:[rank15]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank15]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f14d04fc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank15]: frame #1: + 0x5b3a23e (0x7f150a01923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f150a013c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f150a013f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f150a014fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1509fc9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1509fc9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1509fc9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1509fc9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f14d17d6189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f14d17dd610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f14d17fc978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #12: + 0x5adc309 (0x7f1509fbb309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #13: + 0x5ae6f10 (0x7f1509fc5f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #14: + 0x5ae6fa5 (0x7f1509fc5fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #15: + 0x5124446 (0x7f1509603446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #16: + 0x1acf4b8 (0x7f1505fae4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #17: + 0x5aee004 (0x7f1509fcd004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #18: + 0x5af36b5 (0x7f1509fd26b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #19: + 0xd2631e (0x7f151cbbc31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #20: + 0x47def4 (0x7f151c313ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #21: + 0x1445a6 (0x55e18ac7e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55e18ac77a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #23: + 0x150866 (0x55e18ac8a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e18ac73142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55e18ac7ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #26: PyObject_Call + 0xbc (0x55e18ac8af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e18ac712b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55e18ac7ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #30: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #32: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #34: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e18ac76f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55e18ac88c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #38: + 0x211239 (0x55e18ad4b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55e18ac77a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e18ac733e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55e18ac7ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e18ac6ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55e18ac7ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #45: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #46: PyObject_Call + 0xbc (0x55e18ac8af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e18ac712b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #48: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #49: PyObject_Call + 0xbc (0x55e18ac8af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e18ac712b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55e18ac7ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e18ac77007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55e18ac88c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #54: + 0x211239 (0x55e18ad4b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #55: _PyObject_MakeTpCall + 0x26b (0x55e18ac77a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55e18ac73c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #57: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e18ac6f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #59: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #60: PyObject_Call + 0xbc (0x55e18ac8af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e18ac712b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #62: + 0x150582 (0x55e18ac8a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #63: PyObject_Call + 0xbc (0x55e18ac8af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank9]: dist.recv( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default1]:[rank9]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank9]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5dae36b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank9]: frame #1: + 0x5b3a23e (0x7f5de7e8823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5de7e82c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5de7e82f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5de7e83fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5de7e38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5de7e38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5de7e38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5de7e38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5daf645189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5daf64c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5daf66b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #12: + 0x5adc309 (0x7f5de7e2a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #13: + 0x5ae6f10 (0x7f5de7e34f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #14: + 0x5ae6fa5 (0x7f5de7e34fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #15: + 0x5124446 (0x7f5de7472446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #16: + 0x1acf4b8 (0x7f5de3e1d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #17: + 0x5aee004 (0x7f5de7e3c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #18: + 0x5af36b5 (0x7f5de7e416b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #19: + 0xd2631e (0x7f5dfaa2b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank9]: frame #20: + 0x47def4 (0x7f5dfa182ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank9]: frame #21: + 0x1445a6 (0x55b94d86e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b94d867a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #23: + 0x150866 (0x55b94d87a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b94d863142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b94d86ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #26: PyObject_Call + 0xbc (0x55b94d87af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b94d8612b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b94d86ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #30: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #32: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #34: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b94d866f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b94d878c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #38: + 0x211239 (0x55b94d93b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b94d867a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b94d8633e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b94d86ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b94d85ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b94d86ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #45: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #46: PyObject_Call + 0xbc (0x55b94d87af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b94d8612b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #48: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #49: PyObject_Call + 0xbc (0x55b94d87af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b94d8612b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b94d86ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b94d867007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b94d878c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #54: + 0x211239 (0x55b94d93b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #55: PyObject_Call + 0x207 (0x55b94d87b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b94d8612b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #57: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b94d85f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #59: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #60: PyObject_Call + 0xbc (0x55b94d87af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b94d8612b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #62: + 0x150582 (0x55b94d87a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #63: PyObject_Call + 0xbc (0x55b94d87af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank10]: recv_activation_tensor = recv_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default2]:[rank10]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank10]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55be675897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank10]: frame #1: + 0x5b3a23e (0x7f55f819223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f55f818cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f55f818cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f55f818dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f55f8142371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f55f8142371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f55f8142371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f55f8142371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f55bf94f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f55bf956610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f55bf975978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #12: + 0x5adc309 (0x7f55f8134309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #13: + 0x5ae6f10 (0x7f55f813ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #14: + 0x5ae6fa5 (0x7f55f813efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #15: + 0x5124446 (0x7f55f777c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #16: + 0x1acf4b8 (0x7f55f41274b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #17: + 0x5aee004 (0x7f55f8146004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #18: + 0x5af36b5 (0x7f55f814b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #19: + 0xd2631e (0x7f560ad3531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #20: + 0x47def4 (0x7f560a48cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #21: + 0x1445a6 (0x55b2a57b25a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b2a57aba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #23: + 0x150866 (0x55b2a57be866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b2a57a7142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b2a57b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #26: PyObject_Call + 0xbc (0x55b2a57bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2a57a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b2a57b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #30: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #32: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #34: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b2a57aaf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b2a57bcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #38: + 0x211239 (0x55b2a587f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b2a57aba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b2a57a73e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b2a57b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b2a57a2c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b2a57b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #45: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #46: PyObject_Call + 0xbc (0x55b2a57bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2a57a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #48: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #49: PyObject_Call + 0xbc (0x55b2a57bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2a57a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b2a57b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b2a57ab007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b2a57bcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #54: + 0x211239 (0x55b2a587f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #55: PyObject_Call + 0x207 (0x55b2a57bf067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2a57a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #57: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b2a57a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #59: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #60: PyObject_Call + 0xbc (0x55b2a57bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2a57a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #62: + 0x150582 (0x55b2a57be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #63: PyObject_Call + 0xbc (0x55b2a57bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: . This may indicate a possible application crash on rank 0 or a network set up issue. E0703 16:21:41.095000 140247216203584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 262596) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_16:21:39 host : ip-26-0-160-225.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 262597) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_16:21:39 host : ip-26-0-160-225.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 262596) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank11]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default3]:[rank11]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank11]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9520baa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank11]: frame #1: + 0x5b3a23e (0x7f955a6c723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f955a6c1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f955a6c1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f955a6c2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f955a677371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f955a677371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f955a677371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f955a677371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9521e84189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f9521e8b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f9521eaa978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #12: + 0x5adc309 (0x7f955a669309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #13: + 0x5ae6f10 (0x7f955a673f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #14: + 0x5ae6fa5 (0x7f955a673fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #15: + 0x5124446 (0x7f9559cb1446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #16: + 0x1acf4b8 (0x7f955665c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #17: + 0x5aee004 (0x7f955a67b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #18: + 0x5af36b5 (0x7f955a6806b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #19: + 0xd2631e (0x7f956d26a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #20: + 0x47def4 (0x7f956c9c1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #21: + 0x1445a6 (0x55e002e175a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55e002e10a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #23: + 0x150866 (0x55e002e23866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e002e0c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55e002e17a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #26: PyObject_Call + 0xbc (0x55e002e23f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e002e0a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55e002e17a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #30: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #32: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #34: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e002e0ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55e002e21c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #38: + 0x211239 (0x55e002ee4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55e002e10a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e002e0c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55e002e17a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e002e07c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55e002e17a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #45: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #46: PyObject_Call + 0xbc (0x55e002e23f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e002e0a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #48: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #49: PyObject_Call + 0xbc (0x55e002e23f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e002e0a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55e002e17a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e002e10007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55e002e21c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #54: + 0x211239 (0x55e002ee4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #55: PyObject_Call + 0x207 (0x55e002e24067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e002e0a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #57: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e002e088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #59: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #60: PyObject_Call + 0xbc (0x55e002e23f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e002e0a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #62: + 0x150582 (0x55e002e23582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #63: PyObject_Call + 0xbc (0x55e002e23f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: . This may indicate a possible application crash on rank 0 or a network set up issue. srun: error: ip-26-0-160-225: task 0: Exited with exit code 1 W0703 16:21:43.976000 140456766490368 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-134.ec2.internal_2948039_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 16:21:44.876000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2948109 closing signal SIGTERM W0703 16:21:44.877000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2948110 closing signal SIGTERM W0703 16:21:44.877000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2948111 closing signal SIGTERM W0703 16:21:44.877000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2948112 closing signal SIGTERM W0703 16:21:44.878000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2948115 closing signal SIGTERM E0703 16:21:45.795000 140462433310528 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2948108) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 16:21:45.800000 140462433310528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-134.ec2.internal_2948039_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 16:21:45.829000 140462433310528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-134.ec2.internal_2948039_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 16:21:45.844000 140462433310528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-134.ec2.internal_2948039_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_16:21:44 host : ip-26-0-163-134.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 2948113) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_16:21:44 host : ip-26-0-163-134.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 2948114) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_16:21:44 host : ip-26-0-163-134.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 2948108) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-134: task 1: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.