Finetuning details/code

#12

by jerobich - opened Dec 22, 2022

Dec 22, 2022

Thanks for this great experiment, it is thought-provoking!

I would be interested in pushing the limits a bit, seeing how we can produce "events" that last less than 10 seconds. Saying like : groovy rock music with a 4 seconds sax reel". I'm curious how we could deal with ambiophonic events, like dog barking or bird signing? What do you suggest we do if we have smaller audio clips, (like a 2 sec bird song)? Should we just loop the sound to reach 10 secs or pad the audio with silence?

What if we're trying to capture something that exceed 10 secs, like long fog horn? I understand the 10 secs is a choice you made to match the image size, but once we pick a limit, we're kind of stuck with it...

Finally, any chance you would share your fine-tuning setup? It would save us lots of time to try to push the envelope of what you've accomplished!

Thanks again!

jbmaxwell

Jan 15, 2023

I can get fine-tuning to run just using the examples/train_text_to_image.py script. You do need to format your data, but that's in the HF docs so not a huge issue.
In my case I had to download the riffusion model—I was getting errors when trying to get it directly from HF (i.e., in code). So I used git to download a local copy (you need lfs).

The problem I'm having is that, when trying to run the model, I get a type error: RuntimeError: Input type (c10::Half) and bias type (float) should be the same
This is happening in F.conv2d(). Full trace:

Traceback (most recent call last):
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "/home/james/src/somms/riffusion/riffusion/streamlit/pages/text_to_audio.py", line 102, in <module>
    render_text_to_audio()
  File "/home/james/src/somms/riffusion/riffusion/streamlit/pages/text_to_audio.py", line 78, in render_text_to_audio
    image = streamlit_util.run_txt2img(
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/caching/cache_utils.py", line 428, in wrapper
    return get_or_create_cached_value()
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/caching/cache_utils.py", line 401, in get_or_create_cached_value
    return_value = func(*args, **kwargs)
  File "/home/james/src/somms/riffusion/riffusion/streamlit/util.py", line 103, in run_txt2img
    output = pipeline(
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 531, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 421, in forward
    sample = self.conv_in(sample)
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

Any thoughts as to what might be going on?

jbmaxwell

Jan 15, 2023

Ah, false alarm. I had hacked some stuff yesterday while trying to get things working... removing my hacks it works fine.

dustyatx

Jan 17, 2023

@jbmaxwell I'm working on the same thing.. What's a good way to connect with you so we can share what we learn along the way?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment