🚩 Report: Not working
Getting a file not found for pytorch.bin error. HF tells me I run out of CUDA memory, but I tried with both 16 and 30 gigs Nvidia T4 TPUs and same error. Logs show it's a file not found error. Testing with AG10 (someone in discussions claimed it worked for them when using one) and will report back.
I get the following error using an AG10 small.
Enabling memory efficient attention with xformers...
Could not enable memory efficient attention. Make sure xformers is installed correctly and a GPU is available: No operator found for memory_efficient_attention_forward
with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0flshattF
is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})tritonflashattF
is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
requires A100 GPUcutlassF
is not supported because:
xFormers wasn't build with CUDA supportsmallkF
is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40
/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to from_config
.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Nothing happened for 10 minutes so I ended the space.
I also get an error when training with a T4 small. However, I have no error message as the log was already cleared when I wanted to check it. Happened three times in a row now.
Will now test with a T4 medium.
Ok, I was able to get a stack trace this time (on a T4 medium):
/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to from_config
.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
thierry thierry thierry Adding Safety Checker to the model...
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/routes.py", line 337, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/blocks.py", line 1015, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/blocks.py", line 833, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/helpers.py", line 584, in tracked_fn
response = fn(*args)
File "/home/user/app/app.py", line 344, in train
push(model_name, where_to_upload, hf_token, which_model, True)
File "/home/user/app/app.py", line 364, in push
convert("output_model", "model.ckpt")
File "/home/user/app/convertosd.py", line 270, in convert
unet_state_dict = torch.load(unet_path, map_location="cpu")
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output_model/unet/diffusion_pytorch_model.bin'
Ok, my problem may be this: https://huggingface.co/spaces/multimodalart/dreambooth-training/discussions/88
Ok, my problem may be this: https://huggingface.co/spaces/multimodalart/dreambooth-training/discussions/88
Yes, I was referencing this post in my second comment. Please do let me know if the proposed solution of using an AG10 works for you, as I couldn't get it working.
Actually, I think it may work with an AG10. I saw that, just like yours, the container log seemed to be stuck for ages. But then it eventually continued! Unfortunately, my system was shut down due to my inactivity settings. I will try again with a longer inactivity timeout.
However, I also noticed that there were a bunch of "CUDA Out of memory" errors in my notification inbox. But I am unsure which of my trial runs they belong to.
I'll let you know once I made my next attempt... Should be shortly.
Oh my god, yes, it took ages, but it is indeed working with an AG10! I set my sleep timeout to 10 hours and it worked!
Oh my god, yes, it took ages, but it is indeed working with an AG10! I set my sleep timeout to 10 hours and it worked!
Thank you, I'll close this issue since it worked out for you.