nateraw's picture
Update README.md
046d174 verified
metadata
pipeline_tag: text-to-audio
library_name: audiocraft
language: en
tags:
  - text-to-audio
  - musicgen
  - songstarter
license: cc-by-nc-4.0

Model Card for musicgen-songstarter-v0.2

Replicate demo and cloud API Open In Colab Open in Spaces

musicgen-songstarter-v0.2 is a musicgen-stereo-melody-large fine-tuned on a dataset of melody loops from my Splice sample library. It's intended to be used to generate song ideas that are useful for music producers. It generates stereo audio in 32khz.

👀 Update: I wrote a blogpost detailing how and why I trained this model, including training details, the dataset, Weights and Biases logs, etc.

Compared to musicgen-songstarter-v0.1, this new version:

  • was trained on 3x more unique, manually-curated samples that I painstakingly purchased on Splice
  • Is twice the size, bumped up from size medium ➡️ large transformer LM

If you find this model interesting, please consider:

Usage

Install audiocraft:

pip install -U git+https://github.com/facebookresearch/audiocraft#egg=audiocraft

Then, you should be able to load this model just like any other musicgen checkpoint here on the Hub:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('nateraw/musicgen-songstarter-v0.2')
model.set_generation_params(duration=8)  # generate 8 seconds.
wav = model.generate_unconditional(4)    # generates 4 unconditional audio samples
descriptions = ['acoustic, guitar, melody, trap, d minor, 90 bpm'] * 3
wav = model.generate(descriptions)  # generates 3 samples.

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

Prompt Format

Follow the following prompt format:

{tag_1}, {tag_2}, ..., {tag_n}, {key}, {bpm} bpm

For example:

hip hop, soul, piano, chords, jazz, neo jazz, G# minor, 140 bpm

For some example tags, see the prompt format section of musicgen-songstarter-v0.1's readme. The tags there are for the smaller v1 dataset, but should give you an idea of what the model saw.

Samples

Audio Prompt Text Prompt Output
trap, synthesizer, songstarters, dark, G# minor, 140 bpm
acoustic, guitar, melody, trap, D minor, 90 bpm

Training Details

For more verbose details, you can check out the blogpost.

  • code:
    • Repo is here. It's an undocumented fork of facebookresearch/audiocraft where I rewrote the training loop with PyTorch Lightning, which worked a bit better for me.
  • data:
    • around 1700-1800 samples I manually listened to + purchased via my personal Splice account. About 7-8 hours of audio.
    • Given the licensing terms, I cannot share the data.
  • hardware:
    • 8xA100 40GB instance from Lambda Labs
  • procedure:
    • trained for 10k steps, which took about 6 hours
    • reduced segment duration at train time to 15 seconds
  • hparams/logs:
    • See the wandb run, which includes training metrics, logs, hardware metrics at train time, hyperparameters, and the exact command I used when I ran the training script.

Acknowledgements

This work would not have been possible without:

  • Lambda Labs, for subsidizing larger training runs by providing some compute credits
  • Replicate, for early development compute resources

Thank you ❤️