env setup guide: pytorch+deepspeed+transformers

#4
by fkov - opened

hi @sanchit-gandhi , may you provide specs for setting up your environment for pytorch+deepspeed+transformers (for cluster if you had that), please ?

I keep getting the error:
NotADirectoryError: [Errno 20] Not a directory: 'hipconfig'

Hey @fkov - this was done on a vanilla Google GCP T4, see framework info here: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id#framework-versions

To set-up, I just pip installed PyTorch (following the local installation instructions) + DeepSpeed, and installed transformers from main. Could you share the full stack trace for your error please?

I tried with

GPU: NVIDIA A100-SXM4-40GB

1.
https://pytorch.org/get-started/locally/: Stable(2.0.1), Linux, Conda, Python, Cuda 11.8: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

2.
transformers version: 4.31.0.dev0 (from source using git clone https://github.com/huggingface/transformers ) cd transformers pip install . pip install -r /home/fkov/transformers/examples/pytorch/audio-classification/requirements.txt

3.
git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed rm -rf build TORCH_CUDA_ARCH_LIST=β€œ8.0” DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1; python setup.py build_ext -j8 bdist_wheel pip install .

ERROR is:

Using /home/fkov/.cache/torch_extensions/py311_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/fkov/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

…

[1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/TH -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/THC -isystem /home/fkov/.conda/envs/ws/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o

FAILED: custom_cuda_kernel.cuda.o

/usr/bin/nvcc

/usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/TH -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/THC -isystem /home/fkov/.conda/envs/ws/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o

ERROR: No supported gcc/g++ host compiler found.
Use 'nvcc -ccbin ' to specify a host compiler.

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/fkov/.conda/envs/ws/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/fkov/transformers/examples/pytorch/audio-classification/run_audio_classification_w.py", line 51, in
deepspeed.ops.op_builder.CPUAdamBuilder().load()
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e

RuntimeError: Error building extension 'cpu_adam'

[2023-07-07 16:33:02,027] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2819509 [2023-07-07 16:33:02,028] [ERROR] [launch.py:321:sigkill_handler] ['/home/fkov/.conda/envs/ws/bin/python', '-u', 'run_audio_classification_w.py', '--local_rank=0', '--deepspeed', '/home/fkov/transformers/examples/pytorch/audio-classification/ds_config.json', '--model_name_or_path', 'openai/whisper-medium', '--output_dir', 'l/users/fkov/outputs/w_sn', '--overwrite_output_dir', '--remove_unused_columns', 'False', '--do_train', '--do_eval', '--fp16', '--learning_rate', '3e-5', '--max_length_seconds', '30', '--attention_mask', 'False', '--warmup_ratio', '0.1', '--num_train_epochs', '3', '--per_device_train_batch_size', '16', '--gradient_accumulation_steps', '2', '--gradient_checkpointing', 'True', '--per_device_eval_batch_size', '32', '--dataloader_num_workers', '8', '--logging_strategy', 'steps', '--logging_steps', '25', '--evaluation_strategy', 'epoch', '--save_strategy', 'epoch', '--load_best_model_at_end', 'True', '--metric_for_best_model', 'accuracy', '--seed', '0', '--freeze_feature_encoder', 'False', '--push_to_hub'] exits with return code = 1

I managed to do this just installing deepspeed from the most recent pypi version - could you try doing it this way rather than building from source?

I suspect DeepSpeed is not working because I am using a SLURM on-premise cluster for it? @sanchit-gandhi is there a guide on installing DeepSpeed in Conda on cluster login node and then being able to from bash script call the DeepSpeed on other GPU nodes?

right now I tried installing DeepSpeed on login cluster node (without GPU), then run bash script in the created environment on login node. In the bash script I specify the gpu resources and the classification.py to run finetuning

@sanchit-gandhi , got the solution. I had to use DS_BUILD_OPS=0 pip install deepspeed on the login node

Sign up or log in to comment