schnik commited on
Commit
8f9d4fd
1 Parent(s): 3c3b47c

update README with explaination and Gradio interface with examples

Browse files
.gitattributes CHANGED
@@ -449,3 +449,4 @@ inference_large/peft_symmv_large_s_8_video(1).mp4 filter=lfs diff=lfs merge=lfs
449
  inference_large/peft_symmv_large_s_8_video(2).mp4 filter=lfs diff=lfs merge=lfs -text
450
  inference_large/peft_symmv_large_s_8_video(3).mp4 filter=lfs diff=lfs merge=lfs -text
451
  inference_large/peft_symmv_large_s_8_video(4).mp4 filter=lfs diff=lfs merge=lfs -text
 
 
449
  inference_large/peft_symmv_large_s_8_video(2).mp4 filter=lfs diff=lfs merge=lfs -text
450
  inference_large/peft_symmv_large_s_8_video(3).mp4 filter=lfs diff=lfs merge=lfs -text
451
  inference_large/peft_symmv_large_s_8_video(4).mp4 filter=lfs diff=lfs merge=lfs -text
452
+ *.mp4 filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ venv/
2
+ *.pyc
README.md CHANGED
@@ -8,7 +8,18 @@ language:
8
  ### Abstract
9
  Current AI music generation models are mainly controlled with a single input modality: text. Adapting these models to accept alternative input modalities extends their field of use. Video input is one such modality, with remarkably different requirements for the generation of background music accompanying it. Even though alternative methods for generating video background music exist, none achieve the music quality and diversity of the text-based models. Hence, this thesis aims to efficiently reuse text-based models' high-fidelity music generation capabilities by adapting them for video background music generation. This is accomplished by training a model to represent video information inside a format that the text-based model can naturally process. To test the capabilities of our approach, we apply two datasets for model training with various levels of variation in the visual and audio parts. We evaluate our approach by analyzing the audio quality and diversity of the results. A case study is also performed to determine the video encoder's ability to capture the video-audio relationship successfully.
10
 
11
- This repository contains the code for the pretrained models for the adaptation of MusicGen([https://arxiv.org/abs/2306.05284](https://arxiv.org/abs/2306.05284)) to video background music generation. The full code is available at [https://git.rwth-aachen.de/i5/master-thesis-niklas-schulte](https://git.rwth-aachen.de/i5/master-thesis-niklas-schulte)
12
 
13
- # Contact
 
 
 
 
 
 
 
 
 
 
 
14
  For any questions contact me at [niklas.schulte@rwth-aachen.de](mailto:niklas.schulte@rwth-aachen.de)
 
8
  ### Abstract
9
  Current AI music generation models are mainly controlled with a single input modality: text. Adapting these models to accept alternative input modalities extends their field of use. Video input is one such modality, with remarkably different requirements for the generation of background music accompanying it. Even though alternative methods for generating video background music exist, none achieve the music quality and diversity of the text-based models. Hence, this thesis aims to efficiently reuse text-based models' high-fidelity music generation capabilities by adapting them for video background music generation. This is accomplished by training a model to represent video information inside a format that the text-based model can naturally process. To test the capabilities of our approach, we apply two datasets for model training with various levels of variation in the visual and audio parts. We evaluate our approach by analyzing the audio quality and diversity of the results. A case study is also performed to determine the video encoder's ability to capture the video-audio relationship successfully.
10
 
11
+ This repository contains the code for the pretrained models for the adaptation of MusicGen([https://arxiv.org/abs/2306.05284](https://arxiv.org/abs/2306.05284)) to video background music generation. A Gradio interface is provided for convinient usage of the models.
12
 
13
+ ### Installation
14
+ - install PyTorch `2.1.0` with CUDA enabled by following the instructions from [https://pytorch.org/get-started/previous-versions/][(https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/previous-versions/))
15
+ - install the local fork of the audiocraft with `pip install git+https://github.com/IntelliNik/audiocraft.git@main`
16
+ - install the remaining dependencies with `pip install peft moviepy omegaconf`
17
+
18
+ ### Usage
19
+ - start the Gradio interface with `python app.py`
20
+ - select an example input video or upload one through the interface
21
+ - select the parameters, ("nature", "peft=true" and "large" gives results with the highest audio quality)
22
+ - start the generation by clicking "Submit"
23
+
24
+ ### Contact
25
  For any questions contact me at [niklas.schulte@rwth-aachen.de](mailto:niklas.schulte@rwth-aachen.de)
app.py CHANGED
@@ -52,7 +52,7 @@ interface = gr.Interface(fn=generate_background_music,
52
 
53
  outputs=[gr.Video(label="video output")],
54
  examples=[
55
- [os.path.abspath("../../../videos/originals/n_1.mp4"), "nature", True, "small"],
56
  [os.path.abspath("../../../videos/originals/n_2.mp4"), "nature", True, "small"],
57
  [os.path.abspath("../../../videos/originals/n_3.mp4"), "nature", True, "small"],
58
  [os.path.abspath("../../../videos/originals/n_4.mp4"), "nature", True, "small"],
 
52
 
53
  outputs=[gr.Video(label="video output")],
54
  examples=[
55
+ [os.path.abspath("./videos/originals/n_1.mp4"), "nature", True, "small"],
56
  [os.path.abspath("../../../videos/originals/n_2.mp4"), "nature", True, "small"],
57
  [os.path.abspath("../../../videos/originals/n_3.mp4"), "nature", True, "small"],
58
  [os.path.abspath("../../../videos/originals/n_4.mp4"), "nature", True, "small"],
inference.py CHANGED
@@ -2,7 +2,7 @@ from omegaconf import OmegaConf
2
  from peft import PeftConfig, get_peft_model
3
  from audiocraft.models import MusicGen
4
  from moviepy.editor import AudioFileClip
5
- from code.inference.inference_utils import *
6
  import re
7
  import time
8
 
@@ -19,8 +19,7 @@ def generate_background_music(video_path: str,
19
  musicgen_guidance_scale: float = 3.0,
20
  top_k_sampling: int = 250) -> str:
21
  start = time.time()
22
- model_path = "../training/"
23
- model_path += "models_peft" if use_peft else "models_audiocraft"
24
  model_path += f"/{dataset}" + f"_{musicgen_size}"
25
 
26
  conf = OmegaConf.load(model_path + '/configuration.yml')
 
2
  from peft import PeftConfig, get_peft_model
3
  from audiocraft.models import MusicGen
4
  from moviepy.editor import AudioFileClip
5
+ from inference_utils import *
6
  import re
7
  import time
8
 
 
19
  musicgen_guidance_scale: float = 3.0,
20
  top_k_sampling: int = 250) -> str:
21
  start = time.time()
22
+ model_path = "models_peft" if use_peft else "models_audiocraft"
 
23
  model_path += f"/{dataset}" + f"_{musicgen_size}"
24
 
25
  conf = OmegaConf.load(model_path + '/configuration.yml')
inference_utils.py CHANGED
@@ -146,9 +146,9 @@ class PositionalEncoding(nn.Module):
146
  def __init__(self, d_model: int, dropout: float = 0.1, max_length: int = 5000):
147
  super().__init__()
148
  self.dropout = nn.Dropout(p=dropout)
149
- position = torch.arange(30).unsqueeze(1)
150
  div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
151
- pe = torch.zeros(30, 1, d_model)
152
  pe[:, 0, 0::2] = torch.sin(position * div_term)
153
  pe[:, 0, 1::2] = torch.cos(position * div_term)
154
  self.register_buffer('pe', pe)
 
146
  def __init__(self, d_model: int, dropout: float = 0.1, max_length: int = 5000):
147
  super().__init__()
148
  self.dropout = nn.Dropout(p=dropout)
149
+ position = torch.arange(max_length).unsqueeze(1)
150
  div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
151
+ pe = torch.zeros(max_length, 1, d_model)
152
  pe[:, 0, 0::2] = torch.sin(position * div_term)
153
  pe[:, 0, 1::2] = torch.cos(position * div_term)
154
  self.register_buffer('pe', pe)
videos/originals/n_1.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92f51ce1c5d412305ff6c69fe9dbb40165cb2c5c96f9cc992f267469fdb958d4
3
+ size 4804928
videos/originals/n_2.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44a3a6916f666ddf6341c8175c4f2c87f08374f4e83fe25ddf94545024d3a4c4
3
+ size 4158436
videos/originals/n_3.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35ea7c138471d40444d9f7b6fad59c8208645ed5b59552a9ea9aea2903ae9601
3
+ size 8422094
videos/originals/n_4.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc01b9e9f1d2e3714450d1e22773c605137502a12a80ae2e218f790c85ff5322
3
+ size 9284034
videos/originals/n_5.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59e290b1cfc42c821e317f7ee95bc2c433a24859306606a0fb836b3f9e21a61a
3
+ size 5170579
videos/originals/n_6.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f224455ea6cf80139b4a1c99116f086f928a8b7a70b71958963647e51eaa36dd
3
+ size 9048361
videos/originals/n_7.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f731e0efe8184ee17565c57ab61e94a04baed1e57aa802e91a8260efabd22f69
3
+ size 5035622
videos/originals/n_8.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38287ed66edb44bca60d2bc953739f8362046f3937f86cdc7cb26ca32d733a9e
3
+ size 10856903
videos/originals/s_1.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b07f5949d7e01bb900b6a0073f945695e0a1c80dd798b70e883f609d21c3bf20
3
+ size 5537767
videos/originals/s_2.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a7c9bde94ff6b6c969032b8c8a7732689ce95f829a79643dc1396830c674ada
3
+ size 4036978
videos/originals/s_3.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06f4bf34620e7a93dfd6f410653aee5b7506b29902929a5d5a06cef7cdc3ab16
3
+ size 3262708
videos/originals/s_4.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10b9374b2be0ef11532beacbd5dd1b9e98a29a94524ae91c825cd116ee69de34
3
+ size 4561038
videos/originals/s_5.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:958f104dadb5d6ebd0c1bf0cd8c70ffcaca816cb40b60f7a726fd40658d9726f
3
+ size 4942274
videos/originals/s_6.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a17ecfb295f0e2e14b81d89593bf455cc6f00b44b38b4d9889abc325ba6a71c
3
+ size 2624335
videos/originals/s_7.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc544f6854f21a5dd2ed1523df709040563ebc28acfb9b198db08e2576443007
3
+ size 13242170
videos/originals/s_8.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb4778e547e91442263a8d069e97a338672c562d1b2eeb4c910c06600935c0a6
3
+ size 1356501