Cuda out of memory?

#8
by Apotrox - opened

Hello all,

i am pretty new to the entire topic of neural networks and everything surrounding it, so pardon me if i made any rookie mistakes or ask stupid questions.

After a lot of trouble with getting all dependencies for threestudio built and installed, i tried followed the instructions on the Model Card and tried with the "basic" command given here. My GPU (3080 10GB) immediately jumps to 100%, the model gets loaded and "Epoch 0" gets started, only to fail after a minute or two due to memory running out.

Here's the full error message: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacty of 10.00 GiB of which 0 bytes is free. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 526.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

A few questions here:

  • How is it allocated roughly 22gigs even though my gpu only has 10. Does RAM count into that?
  • Would changing that allocation fix the error, even though it'd slow down the generation? If so, where would the PyTorch configs be to change that?
  • i saw that there is the keyword "train" in the command. Does it actually train the NN or does it make the NN produce 3d images/meshes/files from the 2d images provided? If it is only for training, would i have to use the ckpt like the Stable Diffusion one(or the safetensors) and put it in some kind of WebUI?

I hope someone here can point me in the right directions :)

I will Give a better answer latter on , But I highly recommend that you Run it in a cloud, Docker or VM because some model's I still struggling. Anyways I have all the Errors that you can Imagine and never got the one you have. The Ram Doesn't count that much since I have 4 GB Ram and it working properly. I will recommend until i get into your question that you tried to deploy it on HuggingFace.co. , Github or Docker. And you must use For Hugging Face.co. Streamlit. There is working perfect but the many models I still getting complicated to deploy on PC local. Just check it out the model because some of the only can run Private since the doesn't have Limitations;) Answer letter the rest. But just try what I told you and let me know. Thanks for you Accertive Question!

Oww almost forget if you are in Virtual Machine just set 2024 GbRam and 128 for video

Other Think is that: Go to the repo in Github and Fork it to you. Then try to build and deploy

From you Repo in github you deploy in Hugging..... Running with streamlit.

Thank you very much for your answer. It appears that deploying on huggingface or any other cloud services require payment, which is exactly what i'm trying NOT to do by running the NN locally. I will try and set it up all over again. Will keep this updated whenever i find something :)

This comment has been hidden

For what it's worth, I managed to get it running somehow with a 4070ti (12 GB VRAM) in ubuntu 22.04 LTS with Cuda 11.8 following the tutorials and with the following change in stable-zero123.yaml (copied to stable-zero123_custom.yaml):

line 21 changed to: "batch_size: [3, 2, 1]"

then running with prompt: python launch.py --config configs/stable-zero123_custom.yaml --train --gpu 0 system.cleanup_after_validation_step=true system.cleanup_after_test_step=true system.renderer.num_samples_per_ray=128 data.width=128 data.height=128

It's not beautiful, but it works and gives a recognizable output.

I can then get an obj file with this call (adopt directories to created directories)
python launch.py --config ./outputs/zero123-sai/[64,\ 128,\ 256]_hamburger_rgba.png@2024/configs/parsed.yaml --export --gpu 0 resume=./outputs/zero123-sai/[64,\ 128,\ 256]_hamburger_rgba.png@2024/ckpts/last.ckpt system.exporter_type=mesh-exporter

Works in Blender:
Screenshot from 2024-01-07 00-12-06.png

Same steps for custom image (adopt directories):
python launch.py --config configs/stable-zero123_custom.yaml --train --gpu 0 data.image_path=./load/images/dragon-37_rgba.png system.cleanup_after_validation_step=true system.cleanup_after_test_step=true system.renderer.num_samples_per_ray=128 data.width=128 data.height=128

python launch.py --config ./outputs/zero123-sai/[64,\ 128,\ 256]_dragon-37_rgba.png@2024/configs/parsed.yaml --export --gpu 0 resume=./outputs/zero123-sai/[64,\ 128,\ 256]_dragon-37_rgba.png@2024/ckpts/last.ckpt system.exporter_type=mesh-exporter

Please let us/me know if there is a better way.

Hey everyone! I had success (apparently) in the instalation process with no errors. However I'm having the following error when I try starting the train process, but I have no idea why it's ending with a "killed" status. Any idea?

python launch.py --config configs/stable-zero123.yaml --train --gpu 0 system.cleanup_after_validation_step=true system.cleanup_after_test_step=true system.renderer.num_samples_per_ray=128 data.width=128 data.height=128 data.image_path=./load/images/hamburger_rgba.png
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/mediapipe_face/mediapipe_face_common.py:7: UserWarning: The module 'mediapipe' is not installed. The package will have limited functionality. Please install it using the command: pip install 'mediapipe'
warnings.warn(
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/segment_anything/modeling/tiny_vit_sam.py:654: UserWarning: Overwriting tiny_vit_5m_224 in registry with controlnet_aux.segment_anything.modeling.tiny_vit_sam.tiny_vit_5m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
return register_model(fn_wrapper)
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/segment_anything/modeling/tiny_vit_sam.py:654: UserWarning: Overwriting tiny_vit_11m_224 in registry with controlnet_aux.segment_anything.modeling.tiny_vit_sam.tiny_vit_11m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
return register_model(fn_wrapper)
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/segment_anything/modeling/tiny_vit_sam.py:654: UserWarning: Overwriting tiny_vit_21m_224 in registry with controlnet_aux.segment_anything.modeling.tiny_vit_sam.tiny_vit_21m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
return register_model(fn_wrapper)
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/segment_anything/modeling/tiny_vit_sam.py:654: UserWarning: Overwriting tiny_vit_21m_384 in registry with controlnet_aux.segment_anything.modeling.tiny_vit_sam.tiny_vit_21m_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
return register_model(fn_wrapper)
/mnt/d/threestudio/venv/lib/python3.10/site-packages/controlnet_aux/segment_anything/modeling/tiny_vit_sam.py:654: UserWarning: Overwriting tiny_vit_21m_512 in registry with controlnet_aux.segment_anything.modeling.tiny_vit_sam.tiny_vit_21m_512. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
return register_model(fn_wrapper)
Seed set to 0
[INFO] GPU available: True (cuda), used: True
[INFO] TPU available: False, using: 0 TPU cores
[INFO] IPU available: False, using: 0 IPUs
[INFO] HPU available: False, using: 0 HPUs
[INFO] You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/mnt/d/threestudio/threestudio/data/image.py:93: UserWarning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at ../aten/src/ATen/native/Cross.cpp:63.)
right: Float[Tensor, "1 3"] = F.normalize(torch.cross(lookat, up), dim=-1)
[WARNING] Ignoring resolution_milestones since height and width are not changing
[INFO] single image dataset: load image ./load/images/hamburger_rgba.png torch.Size([1, 128, 128, 3])
[WARNING] Ignoring resolution_milestones since height and width are not changing
[INFO] single image dataset: load image ./load/images/hamburger_rgba.png torch.Size([1, 128, 128, 3])
[INFO] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[INFO]
| Name | Type | Params

0 | geometry | ImplicitVolume | 12.6 M
1 | material | DiffuseWithPointLightMaterial | 0
2 | background | SolidColorBackground | 0
3 | renderer | NeRFVolumeRenderer | 0

12.6 M Trainable params
0 Non-trainable params
12.6 M Total params
50.450 Total estimated model params size (MB)
[INFO] Validation results will be saved to outputs/zero123-sai/[64, 128, 256]_hamburger_rgba.png@20240220-145735/save
[INFO] Loading Stable Zero123 ...
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.53 M params.
Killed

Sign up or log in to comment