ERROR!

by kst23 - opened 6 days ago

6 days ago

(heybuddy) root@LUCIFER:~/nova-voice-modules# heybuddy train "hey nova"
Generating features:   0%|                                                                                                         | 0/4 [00:00<?, ?batch/s]2024-11-23 14:17:13,395 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
2024-11-23 14:17:13,395 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
Generating adversarial samples: 100%|█████████████████████████████████████████████████████████████████████████████████| 25000/25000 [16:59<00:00, 24.53it/s]
Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 133.60it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:00<00:00, 130.72it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:00<00:00, 231587.34it/s]
Loading dataset shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:06<00:00, 11.23it/s]
Loading dataset shards:  97%|████████████████████████████████████████████████████████████████████████████████████████████▍  | 73/75 [00:06<00:00, 14.98it/s2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:115) Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]]
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:115) Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:120) Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
2024-11-23 14:35:18,801 [speechbrain.utils.quirks] INFO (quirks.py:120) Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
Generating augmented adversarial samples: 100%|███████████████████████████████████████████████████████████████████████| 25000/25000 [06:37<00:00, 62.89it/s]
Generating augmented adversarial samples: 100%|██████████████████████████████████████████████████████████████████████▉| 24973/25000 [06:37<00:00, 56.23it/s2024-11-23 14:41:58.226262672 [E:onnxruntime:Default, provider_bridge_ort.cc:1862 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /lib/x86_64-linux-gnu/libcudnn_ops.so.9: undefined symbol: _ZN5cudnn5graph13LibraryLoader11getInstanceEv, version libcudnn_graph.so.9
2024-11-23 14:41:58.241383181 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:993 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
2024-11-23 14:42:19.172005032 [E:onnxruntime:Default, provider_bridge_ort.cc:1862 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /lib/x86_64-linux-gnu/libcudnn_ops.so.9: undefined symbol: _ZN5cudnn5graph13LibraryLoader11getInstanceEv, version libcudnn_graph.so.9

2024-11-23 14:42:19.172042341 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:993 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
                                                                                                                                                           2024-11-23 14:53:24,858 [heybuddy] WARNING (embeddings.py:220) Replacing 78 NaN embeddings with random embeddings.██| 400000/400000 [11:30<00:00, 579.36it/s]
Generating adversarial embeddings: 100%|███████████████████████████████████████████████████████████████████████████| 400000/400000 [11:30<00:00, 579.07it/s]
Generating features:  25%|███████████████████████▎                                                                     | 1/4 [36:13<1:48:40, 2173.62s/batch]2024-11-23 14:53:27,609 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
2024-11-23 14:53:27,609 [datasets] INFO (config.py:54) PyTorch version 2.5.1+cu124 available.
/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
Generating adversarial samples: 100%|█████████████████████████████████████████████████████████████████████████████████| 25000/25000 [15:07<00:00, 27.55it/s]
Generating features:  25%|███████████████████████▎                                                                     | 1/4 [51:50<2:35:32, 3110.86s/batch]
Traceback (most recent call last):
  File "/root/anaconda3/envs/heybuddy/bin/heybuddy", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/__main__.py", line 345, in train
    training, validation, testing = WakeWordTrainingDatasetIterator.all(
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/training.py", line 814, in all
    training = cls.default(
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/training.py", line 364, in default
    positive_features, adversarial_features = TrainingFeaturesGenerator.get_training_features(
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/features.py", line 818, in get_training_features
    adversarial_features = adversarial(
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/site-packages/heybuddy/dataset/features.py", line 532, in __call__
    sample_feature_batches.append(future.result())
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/root/anaconda3/envs/heybuddy/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I have been trying to make it run for past 5 days now and it keeps throwing this same error.
Tried it on:

My own device
Google Colab
2xA100 VM from Azure
an H100 VM from Azure

The issue remains constant and I don't know why. Could you please shed some light on it?

benjamin-paine

Owner 3 days ago

Hello, thank you for the report.

A simple test on Google Colab does not reproduce this issue for me - see here: https://colab.research.google.com/drive/17sJ9cMTCsBZPpK3EWVd0S9pAyLVV-r6h?usp=sharing. It's worth noting that I used smaller datasets for training due to Colab's limited file storage space.

One thing I see in your output is that it couldn't create a CUDAExecutionProvider - you should be able to rectify that issue with pip install onnxruntime-gpu. That should make a lot of steps go much faster. In theory HeyBuddy should work without it, but in practice I've always tested with it installed so it's possible some bugs got through.

Could you please check and make sure if you have disk space remaining? If so, could you monitor your CPU RAM to see if you're getting hit by the OOM killer? It tries to balance batch size by your available system resources but it's possible that HeyBuddy is miscalculating and is being killed by your OS.

Thanks again. I'm going to keep trying to see if I can try and reproduce the same error myself.

kst23

2 days ago

Hey @benjamin-paine , thanks for the response.
I did have onnxruntime-gpu installed and there is definitely enough storage as well.

First when this error occurred, I thought the problem would have been because of CUDAToolkit version, so I changed that, but the same issue persists!
In my VM I got about 2TB storage + 220GB RAM + 93GB VRAM so I don't think compute would be an issue.

Cheers ✌️

benjamin-paine

Owner 1 day ago

@kst23 definitely not a compute problem then, you've got quite a bit more firepower than I did when I was developing it.

It's interesting that it's saying it can't create a CUDA provider even though you have the prerequisites installed - there may be something else related to the GPU that this was on. Though since you're saying you've encountered the issue across multiple machines, I'm wondering if it's something more systemic.

Could you try running it with --debug to see if there's any additional output that might point out what's wrong?

kst23

1 day ago

Hey @benjamin-paine , thanks for the response!
I'll surely try and run it with --debug flag and will let you know the outcome.

Thanks once again.
I do find this project quite interesting... Cheers ✌️

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment