Riffsuion Fine-Tune

This is a Fine-Tuned version of Rifussion, trained on bass samples extracted from the NSynth dataset. The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.

Notes

  • This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.

Quickstart Guide

Clone the Riffusion repository and install the requirements.txt file from: Riffusion Github

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
prompt = "Your desired prompt"
image = pipe(prompt).images[0]

After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the spectrogram_image_converter mehtod contained in the Rifussion repo.

from riffusion.spectrogram_image_converter import SpectrogramImageConverter
from riffusion.spectrogram_params import SpectrogramParams

params = SpectrogramParams()
converter = SpectrogramImageConverter(params)
audio = converter.audio_from_spectrogram_image(image)

Fine Tuning

For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: NSynth Dataset

You can find the pre-processed files in my repo, here: DaveLoay/NSynth_Bass_Captions

And as mention in the official Rifussion HF repo, I used the train_text_to_image script contained in the Diffusers repo, which you can check out here: Diffusers Repo

After configuring all dependencies, I used the following code to train the model:

  accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
    --pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
    --dataset_name=DaveLoay/NSynth_Bass_Captions \
    --resolution=512 \
    --use_ema \
    --train_batch_size=3 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --max_train_steps=4000 \
    --learning_rate=1e-05 \
    --max_grad_norm=1 \
    --lr_scheduler="constant" --lr_warmup_steps=0 \
    --output_dir="Riffusion_FT_Bass_512_4000" \
    --push_to_hub

Hardware

The hardware I used to fine-tune this model is:

  • NVIDIA A100 40 GB vRAM hosted in Google Colab Pro

It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.

Credits

You can check the original repositories here:

Riffusion

NSynth Dataset

Diffusers

Downloads last month
9
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train DaveLoay/Riffusion_FineTuning_Tutorial