env setup guide: pytorch+deepspeed+transformers

by fkov - opened Jul 3, 2023

fkov

Jul 3, 2023

hi @sanchit-gandhi , may you provide specs for setting up your environment for pytorch+deepspeed+transformers (for cluster if you had that), please ?

I keep getting the error:
NotADirectoryError: [Errno 20] Not a directory: 'hipconfig'

sanchit-gandhi

Owner Jul 3, 2023

Hey @fkov - this was done on a vanilla Google GCP T4, see framework info here: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id#framework-versions

To set-up, I just pip installed PyTorch (following the local installation instructions) + DeepSpeed, and installed transformers from main. Could you share the full stack trace for your error please?

fkov

Jul 7, 2023

•

edited Jul 7, 2023

I tried with

GPU: NVIDIA A100-SXM4-40GB

1.
https://pytorch.org/get-started/locally/: Stable(2.0.1), Linux, Conda, Python, Cuda 11.8: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

2.
transformers version: 4.31.0.dev0 (from source using git clone https://github.com/huggingface/transformers ) cd transformers pip install . pip install -r /home/fkov/transformers/examples/pytorch/audio-classification/requirements.txt

3.
git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed rm -rf build TORCH_CUDA_ARCH_LIST=“8.0” DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1; python setup.py build_ext -j8 bdist_wheel pip install .

ERROR is:

Using /home/fkov/.cache/torch_extensions/py311_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/fkov/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

…

[1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/TH -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/THC -isystem /home/fkov/.conda/envs/ws/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o

FAILED: custom_cuda_kernel.cuda.o

/usr/bin/nvcc

/usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/TH -isystem /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/include/THC -isystem /home/fkov/.conda/envs/ws/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o

ERROR: No supported gcc/g++ host compiler found.
Use 'nvcc -ccbin ' to specify a host compiler.

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/fkov/.conda/envs/ws/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/fkov/transformers/examples/pytorch/audio-classification/run_audio_classification_w.py", line 51, in
deepspeed.ops.op_builder.CPUAdamBuilder().load()
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/fkov/.conda/envs/ws/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e

RuntimeError: Error building extension 'cpu_adam'

[2023-07-07 16:33:02,027] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2819509 [2023-07-07 16:33:02,028] [ERROR] [launch.py:321:sigkill_handler] ['/home/fkov/.conda/envs/ws/bin/python', '-u', 'run_audio_classification_w.py', '--local_rank=0', '--deepspeed', '/home/fkov/transformers/examples/pytorch/audio-classification/ds_config.json', '--model_name_or_path', 'openai/whisper-medium', '--output_dir', 'l/users/fkov/outputs/w_sn', '--overwrite_output_dir', '--remove_unused_columns', 'False', '--do_train', '--do_eval', '--fp16', '--learning_rate', '3e-5', '--max_length_seconds', '30', '--attention_mask', 'False', '--warmup_ratio', '0.1', '--num_train_epochs', '3', '--per_device_train_batch_size', '16', '--gradient_accumulation_steps', '2', '--gradient_checkpointing', 'True', '--per_device_eval_batch_size', '32', '--dataloader_num_workers', '8', '--logging_strategy', 'steps', '--logging_steps', '25', '--evaluation_strategy', 'epoch', '--save_strategy', 'epoch', '--load_best_model_at_end', 'True', '--metric_for_best_model', 'accuracy', '--seed', '0', '--freeze_feature_encoder', 'False', '--push_to_hub'] exits with return code = 1

sanchit-gandhi

Owner Jul 10, 2023

I managed to do this just installing deepspeed from the most recent pypi version - could you try doing it this way rather than building from source?

fkov

Jul 12, 2023

I suspect DeepSpeed is not working because I am using a SLURM on-premise cluster for it? @sanchit-gandhi is there a guide on installing DeepSpeed in Conda on cluster login node and then being able to from bash script call the DeepSpeed on other GPU nodes?

right now I tried installing DeepSpeed on login cluster node (without GPU), then run bash script in the created environment on login node. In the bash script I specify the gpu resources and the classification.py to run finetuning

fkov

Jul 12, 2023

@sanchit-gandhi , got the solution. I had to use DS_BUILD_OPS=0 pip install deepspeed on the login node

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment