Memory Limits?

#18
by bmoviepop - opened

I tried training a v2 512 person model with 19 images and it ran out of the space with the T4 Small and Medium. Here is the error. Any help?

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 14.76 GiB total capacity; 13.46 GiB already allocated; 43.75 MiB free; 13.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

bmoviepop changed discussion title from Space Limits? to Memory Limits?
bmoviepop changed discussion status to closed
bmoviepop changed discussion status to open

In the other CUDA thread the poster said to 'Resolved with restart and rebuild.' - a factory restart did not fix the error

You could technically set PYTORCH_CUDA_ALLOC_CONF as a repo secret in settings. "Secrets are exposed as environment variables...".

Hi,do you fixed this error? I also try factory restart, but it didn't work

I faced the same issue, tired T4 Medium for more memory but still not working and gives me out of memory ..a factory restart did not fix the error for me as well

T4 medium with 1 person at 10 images, each image at most 512px in the major axis. Still getting out of memory errors post-factory reset. No solution as of yet

Caching latents: 100%|██████████| 10/10 [00:02<00:00,  3.35it/s]

  0%|          | 0/1500 [00:00  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 292, in run_predict
    output = await app.blocks.process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1007, in process_api
    result = await self.call_function(fn_index, inputs, iterator, request)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 848, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "app.py", line 243, in train
    run_training(args_general)
  File "/home/user/app/train_dreambooth.py", line 763, in run_training
    accelerator.backward(loss)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/accelerate/accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 1.58 GiB (GPU 0; 14.76 GiB total capacity; 11.81 GiB already allocated; 587.75 MiB free; 13.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here's a more detailed stack trace, this time with 10 256px square images in person mode. It seems to fail immediately when the training starts, as if its failing to even load the model.

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

What exactly did you change?

@abesmon Seems like it calls for 0.6.0 in the code? Let us know what you changed/what file.

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

I changed to diffuser 10 but I get the same error, interestingly the error is mot reproduced in 2.7 SD.

using diffusers 0.10.0 fixed the issue for me.

Seems like they changed(made worse) in the last update ;(

What exactly did you change?

I've just forced to install 0.10.0 using pip install --upgrade diffusers==0.10.0 before running application

I think you can achieve same result by changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 in the requirements.txt

Maybe do not train the text encoder can skip OOM, but will get a result which is not so good

I wonder if this is more related to something about the images being used, as I made 10 models successfully and then randomly have been getting this error consistently on the same build. Sadly I don't have the source images from my previous models or else I'd try one of those again.

I think not. I have trained successfully with this space on different amount of images for a long time and started getting those errors only yesterday. Just the day diffusers updated to 0.10.2

Confirmed on my end that changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 in requirements.txt works. Thanks!

I've been training models all day and its been working fine, until just now. Now I keep getting the CUDA whatever I try :(.

edit: Can also confirm that changing git+https://github.com/huggingface/diffusers.git to diffusers==0.10.0 works.

diffusers==0.10.0 also worked for me :)

Sorry all, yes changing the requirements.txt (python package dependencies) for diffusers, from pulling the git repo directly (bleeding edge) to a tagged and packaged version such as 0.10 will always be more stable. And in this case works. I always recommend pinning the version so your not pulling bleeding edge code that will most likely break from time to time. Especially since other dependencies may need to be updated as well, in lock step.

Dumb question, how do I change git+https://github.com/huggingface/diffusers.git via the website interface?

Nevermind, didn't realize it was a literal swap. :-)

Hey everyone - this should be working again with all versions of diffusers :)
(You can just duplicate the Space again and get it working)

multimodalart changed discussion status to closed

Does still not work for me. :-(

I too am still experiencing this issue constantly. Have tried a variety of people and objects with 10-20 512x512 photos each and always get a CUDA out of memory error shortly after beginning. Tried on T4 small, medium and A10 small. Tried using 1.5, 2.1-512 and 2.1-768. Tried to force diffusers 0.10.0 as mentioned above and it still errors out. I factory reboot the space each time after the error.

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 14.76 GiB total capacity; 13.38 GiB already allocated; 45.75 MiB free; 13.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas, @multimodalart ? Are the images I'm using to blame?

Sign up or log in to comment