multimodalart/dreambooth-training

Dec 7, 2022

I tried training a v2 512 person model with 19 images and it ran out of the space with the T4 Small and Medium. Here is the error. Any help?

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 14.76 GiB total capacity; 13.46 GiB already allocated; 43.75 MiB free; 13.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

bmoviepop changed discussion title from Space Limits? to Memory Limits? Dec 7, 2022

bmoviepop changed discussion status to closed Dec 7, 2022

bmoviepop changed discussion status to open Dec 7, 2022

bmoviepop

Dec 7, 2022

In the other CUDA thread the poster said to 'Resolved with restart and rebuild.' - a factory restart did not fix the error

JasonGiedymin

Dec 7, 2022

You could technically set PYTORCH_CUDA_ALLOC_CONF as a repo secret in settings. "Secrets are exposed as environment variables...".

Alon77777

Dec 7, 2022

Hi,do you fixed this error? I also try factory restart, but it didn't work

osamajandali

Dec 10, 2022

I faced the same issue, tired T4 Medium for more memory but still not working and gives me out of memory ..a factory restart did not fix the error for me as well

RoboPhred

Dec 10, 2022

T4 medium with 1 person at 10 images, each image at most 512px in the major axis. Still getting out of memory errors post-factory reset. No solution as of yet

RoboPhred

Dec 10, 2022

Caching latents: 100%|██████████| 10/10 [00:02<00:00,  3.35it/s]

  0%|          | 0/1500 [00:00  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 292, in run_predict
    output = await app.blocks.process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1007, in process_api
    result = await self.call_function(fn_index, inputs, iterator, request)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 848, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "app.py", line 243, in train
    run_training(args_general)
  File "/home/user/app/train_dreambooth.py", line 763, in run_training
    accelerator.backward(loss)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/accelerate/accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 1.58 GiB (GPU 0; 14.76 GiB total capacity; 11.81 GiB already allocated; 587.75 MiB free; 13.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here's a more detailed stack trace, this time with 10 256px square images in person mode. It seems to fail immediately when the training starts, as if its failing to even load the model.

abesmon

Dec 10, 2022

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

Bowser

Dec 10, 2022

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

What exactly did you change?

djn93

Dec 11, 2022

@abesmon Seems like it calls for 0.6.0 in the code? Let us know what you changed/what file.

Markfm

Dec 11, 2022

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

I changed to diffuser 10 but I get the same error, interestingly the error is mot reproduced in 2.7 SD.

abesmon

Dec 11, 2022

•

edited Dec 11, 2022

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

What exactly did you change?

I've just forced to install 0.10.0 using pip install --upgrade diffusers==0.10.0 before running application

I think you can achieve same result by changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 in the requirements.txt

Alon77777

Dec 11, 2022

Maybe do not train the text encoder can skip OOM, but will get a result which is not so good

djn93

Dec 11, 2022

I wonder if this is more related to something about the images being used, as I made 10 models successfully and then randomly have been getting this error consistently on the same build. Sadly I don't have the source images from my previous models or else I'd try one of those again.

abesmon

Dec 11, 2022

I think not. I have trained successfully with this space on different amount of images for a long time and started getting those errors only yesterday. Just the day diffusers updated to 0.10.2

djn93

Dec 11, 2022

Confirmed on my end that changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 in requirements.txt works. Thanks!

duja1

Dec 11, 2022

•

edited Dec 11, 2022

I've been training models all day and its been working fine, until just now. Now I keep getting the CUDA whatever I try :(.

edit: Can also confirm that changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 works.

soysauce44

Dec 11, 2022

diffusers==0.10.0 also worked for me :)

JasonGiedymin

Dec 12, 2022

Sorry all, yes changing the requirements.txt (python package dependencies) for diffusers, from pulling the git repo directly (bleeding edge) to a tagged and packaged version such as 0.10 will always be more stable. And in this case works. I always recommend pinning the version so your not pulling bleeding edge code that will most likely break from time to time. Especially since other dependencies may need to be updated as well, in lock step.

darkmentat

Dec 13, 2022

•

edited Dec 13, 2022

Dumb question, how do I change git+https://github.com/huggingface/diffusers.git via the website interface?

Nevermind, didn't realize it was a literal swap. :-)

multimodalart

Owner Dec 13, 2022

•

edited Dec 13, 2022

Hey everyone - this should be working again with all versions of diffusers :)
(You can just duplicate the Space again and get it working)

multimodalart changed discussion status to closed Dec 14, 2022

sebi441988

Dec 29, 2022

Does still not work for me. :-(

stockoptions

Jan 5, 2023

•

edited Jan 5, 2023

I too am still experiencing this issue constantly. Have tried a variety of people and objects with 10-20 512x512 photos each and always get a CUDA out of memory error shortly after beginning. Tried on T4 small, medium and A10 small. Tried using 1.5, 2.1-512 and 2.1-768. Tried to force diffusers 0.10.0 as mentioned above and it still errors out. I factory reboot the space each time after the error.

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 14.76 GiB total capacity; 13.38 GiB already allocated; 45.75 MiB free; 13.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas, @multimodalart ? Are the images I'm using to blame?