RuntimeError: CUDA error: invalid argument

#55
by rcto - opened

I've churned through an evening and about $6 trying to get this to work; including trying on multiple different machine types including the highest specced. As well as with different image sets and parameters.

Every time the thing fails with RuntimeError: CUDA error: invalid argument
(or on subsequent runs would fail with zero memory errors).

Anyone know what's going on here?

Starting single training...
Namespace(Session_dir='', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adam_weight_decay=0.01, cache_latents=False, center_crop=False, class_data_dir=None, class_prompt='', dump_only_text_encoder=False, gradient_accumulation_steps=1, gradient_checkpointing=False, hub_model_id=None, hub_token=None, image_captions_filename=True, instance_data_dir='instance_images', instance_prompt='', learning_rate=2e-06, local_rank=-1, logging_dir='logs', lr_scheduler='polynomial', lr_warmup_steps=0, max_grad_norm=1.0, max_train_steps=1800, mixed_precision='fp16', num_class_images=100, num_train_epochs=1, output_dir='output_model', pretrained_model_name_or_path='/home/user/.cache/huggingface/hub/models--multimodalart--sd-fine-tunable/snapshots/9dabd4dbbdd4c72e2ffbc8fb4e28debef0254949', prior_loss_weight=1.0, push_to_hub=False, resolution=512, sample_batch_size=4, save_n_steps=0, save_starting_step=1, scale_lr=False, seed=42, stop_text_encoder_training=270, tokenizer_name=None, train_batch_size=1, train_only_unet=False, train_text_encoder=True, use_8bit_adam=True, with_prior_preservation=False)
Enabling memory efficient attention with xformers...
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use .from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)

  0%|          | 0/1800 [00:00  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 292, in run_predict
    output = await app.blocks.process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1007, in process_api
    result = await self.call_function(fn_index, inputs, iterator, request)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 848, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "app.py", line 264, in train
    run_training(args_general)
  File "/home/user/app/train_dreambooth.py", line 771, in run_training
    accelerator.backward(loss)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/accelerate/accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/xformers/ops/memory_efficient_attention.py", line 422, in backward
    ) = torch.ops.xformers.efficient_attention_backward_cutlass(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I tried to downgrade diffusers version to 0.10.0, but it didn't change anything, I still git this invalid argument error

Can you share the settings you are using to reach this issue? I have seen quite a bit of users daily are able to successfully train their models, so I would like to try to replicate your issue

Thanks for your reply, let me write down the steps as I currently trying again:

  • I give my duplicated space a A10G Small hardware
  • I restart the space (cloned from your latest version, 9 days ago. My requirements.txt modifications have been rolled back)
  • Build failed after 30mn (first.log)
  • Restart the space manually
  • Build failed after 30mn (second.log)
  • I factory reboot the space
  • Build failed after 30mn (third.log)
  • I decide to switch to a A10G large
  • I factory reboot the space
  • Build failed after 30mn (fourth.log)
  • I decide to delete my space and duplicate again a brand new one
  • Initial build completed in 2 minutes with the default CPU basic hardware (fifth.log)
  • I give this space a A10G Small hardware
  • I did not fill the HUGGING_FACE_HUB_TOKEN secret key as I don't know if it should be the Hugging face write token
  • Finally, build is complete (after 28 minutes) (sixth.log)
  • I choose to train a person, based on 1.5
  • I upload 14 512x512 pictures (513Kb total)
  • I name my concept Niko (as it's his name)
  • I don't use any custom settings
  • I name my model niko-1-5
  • I paste my Hugging Face Write Token
  • I click Start training
  • It failed after about a minute (seventh.log)
  • I give up but before, I fall back on CPU basic hardware to avoid extra-fees (as it still run whereas it crashed)

Thank you so much for the detailed report.
Some of those issues were caused due to instabilities into the Hugging Face Space management infra-structure and others due to a bug on the HF Training Space itself.

I did push today a bugfix for those bugs and mounting new Spaces should be more stable now. Also, if a CUDA error does occur, now the GPU is removed automatically. Sorry for the hassle and free to try again

Nice, thanks for your support!

I just tried again with a brand new dreambooth-training space, HF helped today and I was able to get the app runnin on a a10g small very fast, but training failed early.

Here is the container output :

/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py:1222: UserWarning: The default_enabled parameter of queue has no effect and will be removed in a future version of gradio.
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 337, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api
    result = await self.call_function(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 833, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/helpers.py", line 584, in tracked_fn
    response = fn(*args)
  File "app.py", line 204, in train
    sleep_time = get_sleep_time(hf_token)
  File "app.py", line 185, in get_sleep_time
    return response.json()['runtime']['gcTimeout']
KeyError: 'gcTimeout'

I tried again, deleting, duplicating and refreshing the whole app once the hardware is ready.

I get this training message error then:

Unfortunately there was an error during training your niko-1-5 model.
Please check it out below. Feel free to report this issue to Dreambooth Training:
CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 

And here is the detailed log

Starting single training...
Namespace(Session_dir='', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adam_weight_decay=0.01, cache_latents=False, center_crop=False, class_data_dir=None, class_prompt='', dump_only_text_encoder=False, gradient_accumulation_steps=1, gradient_checkpointing=False, hub_model_id=None, hub_token=None, image_captions_filename=True, instance_data_dir='instance_images', instance_prompt='', learning_rate=2e-06, local_rank=-1, logging_dir='logs', lr_scheduler='polynomial', lr_warmup_steps=0, max_grad_norm=1.0, max_train_steps=2100, mixed_precision='fp16', num_class_images=100, num_train_epochs=1, output_dir='output_model', pretrained_model_name_or_path='/home/user/.cache/huggingface/hub/models--multimodalart--sd-fine-tunable/snapshots/9dabd4dbbdd4c72e2ffbc8fb4e28debef0254949', prior_loss_weight=1.0, push_to_hub=False, resolution=512, sample_batch_size=4, save_n_steps=0, save_starting_step=1, scale_lr=False, seed=42, stop_text_encoder_training=1470, tokenizer_name=None, train_batch_size=1, train_only_unet=False, train_text_encoder=True, use_8bit_adam=True, with_prior_preservation=False)
Enabling memory efficient attention with xformers...
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
 Niko   Niko  Adding Safety Checker to the model...
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 337, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api
    result = await self.call_function(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 833, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/helpers.py", line 584, in tracked_fn
    response = fn(*args)
  File "app.py", line 340, in train
    push(model_name, where_to_upload, hf_token, which_model, True)
  File "app.py", line 360, in push
    convert("output_model", "model.ckpt")
  File "/home/user/app/convertosd.py", line 270, in convert
    unet_state_dict = torch.load(unet_path, map_location="cpu")
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output_model/unet/diffusion_pytorch_model.bin'

Sign up or log in to comment