Spaces:

JanBabela
/

Riffusion-Melodiff-v1

Running

File size: 5,529 Bytes

93058c7
 
 
 
 
f74dc5e
93058c7
 
 
 
f74dc5e
36eb106
 
f2e7c58
 
88bdd0f
 
36eb106
88bdd0f
f2e7c58
88bdd0f
 
 
 
f2e7c58
88bdd0f
f2e7c58
 
93058c7
88bdd0f
 
93058c7
18c9970
37acdc5
 
24cbe7e
37acdc5
24cbe7e
 
37acdc5
86d4eab
 
 
 
 
 
 
 
 
 
18c9970
86d4eab
 
 
 
 
 
 
 
 
18c9970
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36eb106
 
 
eff45f9
93058c7

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8" />
		<meta name="viewport" content="width=device-width" />
		<title>Riffusion-Melodiff-v1</title>
		<link rel="stylesheet" href="style.css" />
	</head>
	<body>
		<div class="card">
			<h1>Riffusion-Melodiff-v1</h1>
			<p><br> Riffusion-Melodiff is simple, but interesting idea, (that I have not seen anywhere else) how to create cover versions from songs.</p>
            <p><br> Riffusion-Melodiff is built on a top of 
              <a href="https://huggingface.co/riffusion/riffusion-model-v1" target="_blank">Riffusion </a> 
              model, which is fine-tuned Stable Diffusion model to generate Mel Spectrograms. (Spectrogram is kind of
              visual representation of music by dividing waveforms into frequencies.) Riffusion-Melodiff does not contain new model, there was no new training, nor fine-tunig. 
              It uses the same model as Riffusion only in a different way.</p>
            <p>Riffusion-Melodiff uses Img2Img pipeline from Diffusers library to modify images of Mel Spectrograms to produce new versions of music. Just upload your audio 
              in wav format (if you have audio in a different format, transfer it first to wav by online converter). Then you may use Img2img pipeline from the Diffusers library 
              with your prompt, seed and strength. Stregth parameter decides, how much will modified audio relate to initial audio and how much it will relate to the prompt. 
              When strength is too low the spectrogram is too similar with original one and we do not receive new modification. When strength is too high, then spectrogram is too 
              close to the new promopt, which may cause loss of melody and/or tempo from the base image. Good values of strength are usually about 0,4-0,5.</p>
            <p>Good modifications are possible for proper prompt, seed and strength values. Those modifications will keep the tempo and melody from the initial audio, but 
              they will change eg. instrument, playing that melody. Also with this pipeline longer than 5s music modifications are possible. If you cut your audio into 5s pieces 
              and use the same prompt, seed and strength for each modification, generated samples will be somewhat consistent. So if you concatenate them together, you will have 
              longer audio modified.</p>
            <p>Quality of the generated music is not amazing, (mediocre, I would say) and it needs a bit of prompt and seed engineering. But it shows one way, how to make cover 
              versions of music in the future.</p>
			<p>
				Colab notebook is included, where you can find step by step, how to do it.
				<a href="https://huggingface.co/spaces/JanBabela/Riffusion-Melodiff-v1/blob/main/melodiff_v1.ipynb" target="_blank">Melodiff_v1</a>.
			</p>
          <p> <br> Examples of music generated by modifying the underlying song: <br> </p>
          <p>
            Amazing Grace, originally played by flute, modified to be played by violin
            <audio controls>
              <source src="Amazing_Grace_flute_i2i_violin.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p>
            Bella Cao, originally played by violin, modified to be played by saxophone
            <audio controls>
              <source src="Bella_Cao_violin_i2i_sax.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p>
            Iko iko, originally played by accordion, modified to be played by saxophone
            <audio controls>
              <source src="Iko_iko_accordion_i2i_sax.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p>
            When the Saints, originally played by violin, modified to be sang by vocals
            <audio controls>
              <source src="When_the_Saints_violin_i2i_vocals.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p> <br> Examples of longer music samples: <br> </p>
          <p>
            Iko iko, originally played by accordion, modified to be played by saxophone
            <audio controls>
              <source src="Iko_iko_long_accordion_i2i_sax.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p>
            Iko iko, originally played by accordion, modified to be played by violin
            <audio controls>
              <source src="Iko_iko_long_sax_i2i_violin.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p>
          <p>
            When the Saints, originally played by piano, modified to be played by flute
            <audio controls>
              <source src="When_the_Saints_long_piano_i2i_flute.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
          </p> 
          <p> <br> Im using standard (not paid) Google Colab Gpu configuration for inference. Im using default values for number of inference steps (23) from the underlying 
            pipelines. With this setup it takes about 8s to produce 5s long modified sample. For start it is ok, I would say.</p>
        </div>
	</body>
</html>