slurm submission log: 2024-05-11 17:54:08.580940 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=16 #SBATCH --dependency=afterok: #SBATCH --gres=gpu:2 #SBATCH --job-name=tthrush-job-4732981 #SBATCH --mem=400G #SBATCH --nodelist=sphinx2 #SBATCH --open-mode=append #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq/train_job_output.txt #SBATCH --partition=sphinx #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --master_port 29504 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 14 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2' ############################### submission to slurm complete! ############################### slurm submission output sbatch: error: Batch job submission failed: Job dependency problem ############################### slurm submission log: 2024-05-11 17:55:06.870798 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=16 #SBATCH --dependency=afterok:7598872 #SBATCH --gres=gpu:2 #SBATCH --job-name=tthrush-job-4842885 #SBATCH --mem=400G #SBATCH --nodelist=sphinx2 #SBATCH --open-mode=append #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq/train_job_output.txt #SBATCH --partition=sphinx #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --master_port 29504 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 14 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2' ############################### submission to slurm complete! ############################### slurm submission output Submitted batch job 7598873 ############################### ############################### start time: 2024-05-11 17:57:32.246511 machine: sphinx2 conda env: pretraining-coreset-selection ############################### running following processes torchrun --master_port 29504 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 14 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2 ############################### command outputs: [2024-05-11 17:57:34,935] torch.distributed.run: [WARNING] [2024-05-11 17:57:34,935] torch.distributed.run: [WARNING] ***************************************** [2024-05-11 17:57:34,935] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-11 17:57:34,935] torch.distributed.run: [WARNING] ***************************************** 05/11/2024 17:57:45 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=14, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 05/11/2024 17:57:45 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=14, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) Traceback (most recent call last): File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 202, in train_model() File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 162, in train_model train_dataset = load_from_disk(script_args.dataset_id) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/datasets/load.py", line 2638, in load_from_disk raise FileNotFoundError( FileNotFoundError: Directory /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq is neither a `Dataset` directory nor a `DatasetDict` directory. Traceback (most recent call last): File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 202, in train_model() File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 162, in train_model train_dataset = load_from_disk(script_args.dataset_id) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/datasets/load.py", line 2638, in load_from_disk raise FileNotFoundError( FileNotFoundError: Directory /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq is neither a `Dataset` directory nor a `DatasetDict` directory. [2024-05-11 17:57:49,960] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3600439) of binary: /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/bin/python Traceback (most recent call last): File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/bin/torchrun", line 8, in sys.exit(main()) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_llm.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-05-11_17:57:49 host : sphinx2.stanford.edu rank : 1 (local_rank: 1) exitcode : 1 (pid: 3600440) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-11_17:57:49 host : sphinx2.stanford.edu rank : 0 (local_rank: 0) exitcode : 1 (pid: 3600439) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ############################### end time: 2024-05-11 17:57:52.272138 elapsed time: 0:00:20.025627 slurm submission log: 2024-05-11 18:01:39.581982 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=16 #SBATCH --dependency=afterok:7598911 #SBATCH --gres=gpu:2 #SBATCH --job-name=tthrush-job-4293909 #SBATCH --mem=400G #SBATCH --nodelist=sphinx2 #SBATCH --open-mode=append #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq/train_job_output.txt #SBATCH --partition=sphinx #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --master_port 29504 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 14 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2' ############################### submission to slurm complete! ############################### slurm submission output Submitted batch job 7598912 ############################### ############################### start time: 2024-05-11 20:42:46.833506 machine: sphinx2 conda env: pretraining-coreset-selection ############################### running following processes torchrun --master_port 29504 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 14 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2 ############################### command outputs: [2024-05-11 20:42:48,810] torch.distributed.run: [WARNING] [2024-05-11 20:42:48,810] torch.distributed.run: [WARNING] ***************************************** [2024-05-11 20:42:48,810] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-11 20:42:48,810] torch.distributed.run: [WARNING] ***************************************** 05/11/2024 20:42:55 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=14, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 05/11/2024 20:42:55 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_5/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_5/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=14, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 0%| | 0/10682 [00:00