diff --git "a/experiments/2022-12-19-fdf21cd1874b02afe17fee417ba59c79dadadd87f9b5944402c89d476acb4861/output.log" "b/experiments/2022-12-19-fdf21cd1874b02afe17fee417ba59c79dadadd87f9b5944402c89d476acb4861/output.log" new file mode 100644--- /dev/null +++ "b/experiments/2022-12-19-fdf21cd1874b02afe17fee417ba59c79dadadd87f9b5944402c89d476acb4861/output.log" @@ -0,0 +1,9862 @@ +nohup: ignoring input +[2022-12-18 10:53:56,268] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. +[2022-12-18 10:53:56,292] [INFO] [runner.py:508:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 tune_gpt.py --deepspeed deepspeed.json --upload-model +[2022-12-18 10:53:57,962] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} +[2022-12-18 10:53:57,962] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 +[2022-12-18 10:53:57,962] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]}) +[2022-12-18 10:53:57,962] [INFO] [launch.py:162:main] dist_world_size=2 +[2022-12-18 10:53:57,962] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 +Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. +Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. +Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. +Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. +No config specified, defaulting to: apps/all +No config specified, defaulting to: apps/all +Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) +Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) +Max length: 2048 +Max length: 2048 +PyTorch: setting up devices +PyTorch: setting up devices +[2022-12-18 10:54:11,976] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl +The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). +The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). +GPU memory occupied: 3404 MB. +GPU memory occupied: 3404 MB. +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +Detected CUDA files, patching ldflags +Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja... +Building extension module cpu_adam... +Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) +ninja: no work to do. +Loading extension module cpu_adam... +Time to load cpu_adam op: 2.6207056045532227 seconds +Loading extension module cpu_adam... +Time to load cpu_adam op: 2.6935393810272217 seconds +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/utils/build.ninja... +Building extension module utils... +Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) +ninja: no work to do. +Loading extension module utils... +Time to load utils op: 0.31526732444763184 seconds +Loading extension module utils... +Time to load utils op: 0.3027935028076172 seconds +Rank: 0 partition count [2] and sizes[(62600064, False)] Rank: 1 partition count [2] and sizes[(62600064, False)] + +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +No modifications detected for re-loaded extension module utils, skipping build step... +Loading extension module utils... +Time to load utils op: 0.0005586147308349609 seconds +Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... +No modifications detected for re-loaded extension module utils, skipping build step... +Loading extension module utils... +Time to load utils op: 0.00031638145446777344 seconds + 0%| | 0/48845 [00:00 + shutil.move(os.path.join(pwd_path, "output.log"), os.path.join(final_save_dir)) + File "/usr/lib/python3.8/shutil.py", line 789, in move + raise Error("Destination path '%s' already exists" % real_dst) +shutil.Error: Destination path 'experiments/2022-12-19-fdf21cd1874b02afe17fee417ba59c79dadadd87f9b5944402c89d476acb4861/output.log' already exists +[2022-12-19 04:13:54,430] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2056 +[2022-12-19 04:13:54,430] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2057 +[2022-12-19 04:13:55,051] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'tune_gpt.py', '--local_rank=1', '--deepspeed', 'deepspeed.json', '--upload-model'] exits with return code = 1