MusicGen-Style: Audio Conditioning for Music Generation via Discrete Bottleneck Features

AudioCraft provides the code and models for MusicGen-Style, Audio Conditioning for Music Generation via Discrete Bottleneck Features.

MusicGen-Style is a text-and-audio-to-music model that can be conditioned on textual and audio data (thanks to a style conditioner). The style conditioner takes as input a music excerpt of a few seconds (between 1.5 and 4.5) extracts some features that are used by the model to generate music in the same style. This style conditioning can be mixed with textual description.

Check out our [sample page][musicgen_style_samples] or test the available demo!

We use 16K hours of licensed music to train MusicGen-Style. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.

Model Card

See the model card.

Installation

Please follow the AudioCraft installation instructions from the README.

MusicGen-Stem requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).

Usage

You can play with MusicGen-Style by running the jupyter notebook at demos/musicgen_style_demo.ipynb locally (if you have a GPU).
You can use the gradio demo locally by running python -m demos.musicgen_style_app --share.
You can play with MusicGen by running the jupyter notebook at demos/musicgen_style_demo.ipynb locally (if you have a GPU).

API

We provide a simple API 1 pre-trained model with MERT used as a feature extractor for the style conditioner:

facebook/musicgen-style: medium (1.5B) MusicGen model, text and style to music, generates 30-second samples - 🤗 Hub

In order to use MusicGen-Style locally you must have a GPU. We recommend 16GB of memory.

See after a quick example for using the API.

To perform text-to-music:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-style')


model.set_generation_params(
    duration=8, # generate 8 seconds, can go up to 30
    use_sampling=True, 
    top_k=250,
    cfg_coef=3., # Classifier Free Guidance coefficient 
    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning
)  

descriptions = ['disco beat', 'energetic EDM', 'funky groove']
wav = model.generate(descriptions)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

To perform style-to-music:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-style')


model.set_generation_params(
    duration=8, # generate 8 seconds, can go up to 30
    use_sampling=True, 
    top_k=250,
    cfg_coef=3., # Classifier Free Guidance coefficient 
    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning
)

model.set_style_conditioner_params(
    eval_q=1, # integer between 1 and 6
              # eval_q is the level of quantization that passes
              # through the conditioner. When low, the models adheres less to the 
              # audio conditioning
    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt
    )

melody, sr = torchaudio.load('./assets/electronic.mp3')


wav = model.generate_with_chroma(descriptions=[None, None, None], 
                melody[None].expand(3, -1, -1), sr)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

To perform style-and-text-to-music:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-style')


model.set_generation_params(
    duration=8, # generate 8 seconds, can go up to 30
    use_sampling=True, 
    top_k=250,
    cfg_coef=3., # Classifier Free Guidance coefficient 
    cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning
                   # Beta in the double CFG formula. between 1 and 9. When set to 1 it is equivalent to normal CFG. 
                   # When we increase this parameter, the text condition is pushed. See the bottom of https://musicgenstyle.github.io/ 
                   # to better understand the effects of the double CFG coefficients. 
)

model.set_style_conditioner_params(
    eval_q=1, # integer between 1 and 6
              # eval_q is the level of quantization that passes
              # through the conditioner. When low, the models adheres less to the 
              # audio conditioning
    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt, can be                 
                       # between 1.5 and 4.5 seconds but it has to be shortest to the length of the provided conditioning
    )

melody, sr = torchaudio.load('./assets/electronic.mp3')

descriptions = ["8-bit old video game music", "Chill lofi remix", "80s New wave with synthesizer"]
wav = model.generate_with_chroma(descriptions=["8-bit old video game music"], 
                melody[None].expand(3, -1, -1), sr)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

Training

To train MusicGen-Style, we use the MusicGenSolver.

Note that we do NOT provide any of the datasets used for training MusicGen-Style. We provide a dummy dataset containing just a few examples for illustrative purposes.

Please read first the TRAINING documentation, in particular the Environment Setup section.

Example configurations and grids

We provide the configuration to reproduce the training of MusicGen-Style in config/solver/musicgen/musicgen_style_32khz.yaml,

In particular, the conditioner configuration is provided in /config/conditioner/style2music.yaml.

The grid to train the model is audiocraft/grids/musicgen/musicgen_style_32khz.py.

# text-and-style-to-music
dora grid musicgen.musicgen_style_32khz --dry_run --init

# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.

dataset and metadata

Learn more in the datasets section.

Audio tokenizers

See MusicGen

Fine tuning existing models

You can initialize your model to one of the pretrained models by using the continue_from argument, in particular

# Using pretrained MusicGen-Style model.
dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/musicgen-style conditioner=style2music

# Using another model you already trained with a Dora signature SIG.
dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=style2music

# Or providing manually a path
dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th

Warning: You are responsible for selecting the other parameters accordingly, in a way that make it compatible with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper conditioner and model/lm/model_scale.

Warning: We currently do not support fine tuning a model with slightly different layers. If you decide to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run load_state_dict. If you decide to do so, make sure your checkpoint is saved with torch.save and contains a dict {'best_state': {'model': model_state_dict_here}}. Directly give the path to continue_from without a //pretrained/ prefix.