ERROR!
(heybuddy) root@LUCIFER:~/nova-voice-modules# heybuddy train "hey nova"
Generating features: 0%| | 0/4 [00:00<?, ?batch/s]2024-11-23 14:17:13,395 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
2024-11-23 14:17:13,395 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
Generating adversarial samples: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25000/25000 [16:59<00:00, 24.53it/s]
Loading dataset shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 19/19 [00:00<00:00, 133.60it/s]
Resolving data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:00<00:00, 130.72it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:00<00:00, 231587.34it/s]
Loading dataset shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 75/75 [00:06<00:00, 11.23it/s]
Loading dataset shards: 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 73/75 [00:06<00:00, 14.98it/s2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:115) Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]]
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:115) Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:120) Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:120) Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
Generating augmented adversarial samples: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25000/25000 [06:37<00:00, 62.89it/s]
Generating augmented adversarial samples: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24973/25000 [06:37<00:00, 56.23it/s2024-11-23 14:41:58.226262672 [E:onnxruntime:Default, provider_bridge_ort.cc:1862 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /lib/x86_64-linux-gnu/libcudnn_ops.so.9: undefined symbol: _ZN5cudnn5graph13LibraryLoader11getInstanceEv, version libcudnn_graph.so.9
2024-11-23 14:41:58.241383181 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:993 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
2024-11-23 14:42:19.172005032 [E:onnxruntime:Default, provider_bridge_ort.cc:1862 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /lib/x86_64-linux-gnu/libcudnn_ops.so.9: undefined symbol: _ZN5cudnn5graph13LibraryLoader11getInstanceEv, version libcudnn_graph.so.9
2024-11-23 14:42:19.172042341 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:993 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
2024-11-23 14:53:24,858 [heybuddy] WARNING (embeddings.py:220) Replacing 78 NaN embeddings with random embeddings.ββ| 400000/400000 [11:30<00:00, 579.36it/s]
Generating adversarial embeddings: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 400000/400000 [11:30<00:00, 579.07it/s]
Generating features: 25%|ββββββββββββββββββββββββ | 1/4 [36:13<1:48:40, 2173.62s/batch]2024-11-23 14:53:27,609 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
2024-11-23 14:53:27,609 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
Generating adversarial samples: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25000/25000 [15:07<00:00, 27.55it/s]
Generating features: 25%|ββββββββββββββββββββββββ | 1/4 [51:50<2:35:32, 3110.86s/batch]
Traceback (most recent call last):
File "/root/anaconda3/envs/heybuddy/bin/heybuddy", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/__main__.py", line 345, in train
training, validation, testing = WakeWordTrainingDatasetIterator.all(
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/training.py", line 814, in all
training = cls.default(
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/training.py", line 364, in default
positive_features, adversarial_features = TrainingFeaturesGenerator.get_training_features(
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/features.py", line 818, in get_training_features
adversarial_features = adversarial(
File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/features.py", line 532, in __call__
sample_feature_batches.append(future.result())
File "/root/anaconda3/envs/heybuddy/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/root/anaconda3/envs/heybuddy/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I have been trying to make it run for past 5 days now and it keeps throwing this same error.
Tried it on:
- My own device
- Google Colab
- 2xA100 VM from Azure
- an H100 VM from Azure
The issue remains constant and I don't know why. Could you please shed some light on it?
Hello, thank you for the report.
A simple test on Google Colab does not reproduce this issue for me - see here: https://colab.research.google.com/drive/17sJ9cMTCsBZPpK3EWVd0S9pAyLVV-r6h?usp=sharing. It's worth noting that I used smaller datasets for training due to Colab's limited file storage space.
One thing I see in your output is that it couldn't create a CUDAExecutionProvider - you should be able to rectify that issue with pip install onnxruntime-gpu
. That should make a lot of steps go much faster. In theory HeyBuddy should work without it, but in practice I've always tested with it installed so it's possible some bugs got through.
Could you please check and make sure if you have disk space remaining? If so, could you monitor your CPU RAM to see if you're getting hit by the OOM killer? It tries to balance batch size by your available system resources but it's possible that HeyBuddy is miscalculating and is being killed by your OS.
Thanks again. I'm going to keep trying to see if I can try and reproduce the same error myself.
Hey
@benjamin-paine
, thanks for the response.
I did have onnxruntime-gpu
installed and there is definitely enough storage as well.
First when this error occurred, I thought the problem would have been because of CUDAToolkit
version, so I changed that, but the same issue persists!
In my VM I got about 2TB storage + 220GB RAM + 93GB VRAM so I don't think compute would be an issue.
Cheers βοΈ
@kst23 definitely not a compute problem then, you've got quite a bit more firepower than I did when I was developing it.
It's interesting that it's saying it can't create a CUDA provider even though you have the prerequisites installed - there may be something else related to the GPU that this was on. Though since you're saying you've encountered the issue across multiple machines, I'm wondering if it's something more systemic.
Could you try running it with --debug
to see if there's any additional output that might point out what's wrong?
Hey
@benjamin-paine
, thanks for the response!
I'll surely try and run it with --debug flag and will let you know the outcome.
Thanks once again.
I do find this project quite interesting... Cheers βοΈ