Using 🤗 to Train a GPT-2 Model for Music Generation

Community Article Published October 5, 2023

In this tutorial, I'll walk you through the steps to create a Space similar to this one:

Try it now!

🤗 offers a comprehensive set of tools, from dataset creation to model demo deployment. You'll utilize these tools throughout this tutorial. Therefore, familiarity with the Hugging Face ecosystem will be beneficial. By the end of this tutorial, you will be able to train a GPT-2 model for music generation.

This tutorial is inspired by and builds upon the outstanding work of Dr. Tristan Behrens.

Overview

Generative AI is currently trending in the machine learning field. Impressive models such as ChatGPT or Stable Diffusion have captivated the tech community and the general public with their remarkable capabilities. Major companies like Facebook, OpenAI, and Stability AI have also ventured into this movement by releasing impressive music-generation tools.

There are usually two common approaches for generative music models. You can think of them in the following terms:

  • Raw audio: In this approach, you use the audio raw representation (.wav, .mp3) to train the model. StableAudio and MusicGen use this method.
  • Symbolic music: Rather than using the raw audio representation, you can leverage the instructions that generate that audio. For instance, instead of using the recording of a flute melody, you'd use the score read by the musician to play the tune. MIDI or MusicXML files store the instructions needed to produce a specific piece of music. OpenAI trained MuseNet (no longer available) with symbolic music.

The focus of this tutorial is on symbolic models. Specifically, you'll implement a clever idea: If you can convert the instructions included in symbolic music files (MIDI files for this tutorial) into words, you could leverage the tremendous advancements in NLP to train your model!

Let's dive in together.

Table of Contents:

You will find a series of notebooks to examine each part's code throughout the notebook.

Collecting the Dataset and Converting It to Words

Note: Given the extensive size of the MIDI files required, I've curated a ready-to-use dataset available on Hugging Face. Alternatively, if you'd prefer a smaller dataset, you can utilize the JS Fake Chorales dataset to follow this tutorial.

Collecting the dataset and having it ready for training is the hardest part of the project. Fortunately, people have shared some MIDI collections on the Internet that you can use. You will use one of these collections curated by Colin Raffel, the Lakh MIDI dataset (LMD), which includes 176,581 unique MIDI files. From the LDM, you will use the Clean MIDI subset (14,751 files) with filenames indicating the artist and title.

Getting the Genres

Knowing the artist and title of each file enables you to determine the song's genre. There are many approaches to do this. I used a mixed method where I first used the Spotify API to get the genres based on the artist and then ChatGPT to group them into a final set of more or less balanced genres.

# Spotify API's code snippet
genres = {}
for i,artist in enumerate(artists):
    try:
        results = sp.search(q=artist, type='artist',limit=1)
        items = results['artists']['items']
        genre_list = items[0]['genres'] if len(items) else items['genres']
        genres[artist] = (genre_list[0]).replace(" ", "_")
        if i < 5:
            print("INFO: Preview {}/5".format(i + 1),
                  artist, genre_list[:5])
    except Exception as e:
        genres[artist] = "MISC"
        print("INFO: ", artist, "not included: ", e)

The results are far from perfect, but they are close enough to work for controlling our model. The final CSV file with the genres is available on GitHub.

Why should you get the genres? You could use the genres to incorporate a token

"GENRE={NAME_OF_GENRE}"

into the input sequence that moves the generation process to that specific genre, as you will see next.

Tokenizing the Dataset

Tokenizing Dataset
Fig: Tokenization of musical notes into text-based tokens.

The image above shows you one way to convert music instructions into tokens: Exactly what you want to train a language model! In this section, you'll discover how to transition from a MIDI file to a text-based format using pseudo-words (terms that aren't part of the English vocabulary) for training your GPT-2 model.

Chunking the Dataset

In this tutorial, you will tokenize each file in 8-bar windows where each 'bar' is a segment containing a specified number of beats. Play with other numbers, such as 4 or 16, to see how the result changes. There are many ways to do this, but for simplicity, let's loop over the dataset and create a new MIDI file that is 8-bar long. I used the following code in Colab for doing the chunking.

for i, midi_path in enumerate(tqdm(midi_paths, desc="CHUNKING MIDIS")):
    try:
        # Determine the output directory for this file
        relative_path = midi_path.relative_to(Path("path/to/dataset/lmd", dataset))
        output_dir = merged_out_dir / relative_path.parent
        output_dir.mkdir(parents=True, exist_ok=True)

        # Check if chunks already exist
        chunk_paths = list(output_dir.glob(f"{midi_path.stem}_*.mid"))
        if len(chunk_paths) > 0:
            print(f"Chunks for {midi_path} already exist, skipping...")
            continue

        # Loads MIDI, merges and saves it
        midi = MidiFile(midi_path)
        ticks_per_cut = MAX_NB_BAR * midi.ticks_per_beat * 4
        nb_cuts = ceil(midi.max_tick / ticks_per_cut)
        if nb_cuts < 2:
            continue
        print(f"Processing {midi_path}")
        midis = [deepcopy(midi) for _ in range(nb_cuts)]

        for j, track in enumerate(midi.instruments):  # sort notes as they are not always sorted right
            track.notes.sort(key=lambda x: x.start)
            for midi_short in midis:  # clears notes from shorten MIDIs
                midi_short.instruments[j].notes = []
            for note in track.notes:
                cut_id = note.start // ticks_per_cut
                note_copy = deepcopy(note)
                note_copy.start -= cut_id * ticks_per_cut
                note_copy.end -= cut_id * ticks_per_cut
                midis[cut_id].instruments[j].notes.append(note_copy)

        # Saving MIDIs
        for j, midi_short in enumerate(midis):
            if sum(len(track.notes) for track in midi_short.instruments) < MIN_NB_NOTES:
                continue
            midi_short.dump(output_dir / f"{midi_path.stem}_{j}.mid")

    except Exception as e:
        print(f"An error occurred while processing {midi_path}: {e}")

I added a simplified version of the code. You can take a look at the complete notebook.

From MIDI Instructions to Words

Having segmented each song into 8-bar MIDI files, you're now ready to transform these files into pseudo-words. Researchers have proposed different music tokenization methods, among the most popular you can find:

You can find an excellent overview of different tokenizers in the docs of MidiTok, a powerful Python package to tokenize MIDI music files.

Compatibility table of tokenizations and additional tokens.

Tokenization

Tempo

Time signature

Chord

Rest

Sustain pedal

Pitch bend

MIDILike

REMI

TSD

Structured

CPWord

Octuple

MuMIDI

MMM

Source: MidiTok, Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.

You will use the MMM: Multi-Track Music Machine tokenization method for this tutorial. MMM is a simple yet powerful approach to convert MIDI files to pseudo-words. Try other tokenizers and compare the results. Please let me know which is your favorite tokenizer 😀.

MMM: Multi-Track Music Machine

Jeff Ens and Philippe Pasquier presented the MMM tokenizer in the paper MMM: Exploring Conditional Multi-Track Music Generation with the Transformer. Look at the following illustration from the paper to have a better understanding of this method:

MMM: Multi-Track Music Machine Tokenization Method
"Fig: The MultiTrack and Bar Fill representations are shown. The bar tokens correspond to complete bars, and the track tokens correspond to complete tracks."

In MMM, the numbers represent the pitch of the notes and the instruments in MIDI notation. For example, in the diagram above, the NOTE_ON=60 is the C4, and the INST=30 means an Overdriven Guitar. You use NOTE_ON/NOTE_OFF to indicate when the note starts and stops sounding and TIME_DELTA to move the timeline. The notes are wrapped inside <BAR_START> and <BAR_END> tokens which are added inside <TRACK_START> and <TRACK_END> pseudo-words that you finally group inside <PIECE_START> and <PIECE_END>: MultriTrack Music Machine!

Let's illustrate this with a specific example taken from JS Fake Chorales:

PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START INST=48 BAR_START NOTE_ON=68 TIME_DELTA=4 NOTE_OFF=68 NOTE_ON=67 TIME_DELTA=4 NOTE_OFF=67 NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65 NOTE_ON=63 TIME_DELTA=2 NOTE_OFF=63 NOTE_ON=65 TIME_DELTA=2 NOTE_OFF=65 BAR_END BAR_START NOTE_ON=67 TIME_DELTA=4 NOTE_OFF=67 NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65 NOTE_ON=58 TIME_DELTA=2 NOTE_OFF=58 NOTE_ON=60 TIME_DELTA=2 NOTE_OFF=60 NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62 BAR_END BAR_START NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62 NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 BAR_END BAR_START NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 NOTE_ON=63 TIME_DELTA=12 NOTE_OFF=63 BAR_END TRACK_END TRACK_START INST=0 BAR_START NOTE_ON=72 TIME_DELTA=4 NOTE_OFF=72 NOTE_ON=75 TIME_DELTA=4 NOTE_OFF=75 NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=67 TIME_DELTA=4 NOTE_OFF=67 BAR_END BAR_START NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=2 NOTE_OFF=70 NOTE_ON=68 TIME_DELTA=4 NOTE_OFF=68 NOTE_ON=67 TIME_DELTA=4 NOTE_OFF=67 NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65 BAR_END BAR_START NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=68 TIME_DELTA=4 NOTE_OFF=68 NOTE_ON=67 TIME_DELTA=4 NOTE_OFF=67 NOTE_ON=72 TIME_DELTA=4 NOTE_OFF=72 BAR_END BAR_START NOTE_ON=72 TIME_DELTA=4 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=12 NOTE_OFF=70 BAR_END TRACK_END TRACK_START INST=32 BAR_START NOTE_ON=53 TIME_DELTA=4 NOTE_OFF=53 NOTE_ON=48 TIME_DELTA=4 NOTE_OFF=48 NOTE_ON=50 TIME_DELTA=4 NOTE_OFF=50 NOTE_ON=51 TIME_DELTA=4 NOTE_OFF=51 BAR_END BAR_START NOTE_ON=48 TIME_DELTA=4 NOTE_OFF=48 NOTE_ON=53 TIME_DELTA=4 NOTE_OFF=53 NOTE_ON=55 TIME_DELTA=2 NOTE_OFF=55 NOTE_ON=57 TIME_DELTA=2 NOTE_OFF=57 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 BAR_END BAR_START NOTE_ON=55 TIME_DELTA=4 NOTE_OFF=55 NOTE_ON=48 TIME_DELTA=2 NOTE_OFF=48 NOTE_ON=50 TIME_DELTA=2 NOTE_OFF=50 NOTE_ON=51 TIME_DELTA=4 NOTE_OFF=51 NOTE_ON=44 TIME_DELTA=2 NOTE_OFF=44 NOTE_ON=46 TIME_DELTA=2 NOTE_OFF=46 BAR_END BAR_START NOTE_ON=48 TIME_DELTA=2 NOTE_OFF=48 NOTE_ON=50 TIME_DELTA=2 NOTE_OFF=50 NOTE_ON=51 TIME_DELTA=12 NOTE_OFF=51 BAR_END TRACK_END TRACK_START INST=24 BAR_START NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65 NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 NOTE_ON=65 TIME_DELTA=2 NOTE_OFF=65 NOTE_ON=58 TIME_DELTA=2 NOTE_OFF=58 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 BAR_END BAR_START NOTE_ON=63 TIME_DELTA=2 NOTE_OFF=63 NOTE_ON=62 TIME_DELTA=2 NOTE_OFF=62 NOTE_ON=60 TIME_DELTA=2 NOTE_OFF=60 NOTE_ON=62 TIME_DELTA=2 NOTE_OFF=62 NOTE_ON=63 TIME_DELTA=4 NOTE_OFF=63 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 BAR_END BAR_START NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 NOTE_ON=60 TIME_DELTA=4 NOTE_OFF=60 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 BAR_END BAR_START NOTE_ON=56 TIME_DELTA=4 NOTE_OFF=56 NOTE_ON=55 TIME_DELTA=12 NOTE_OFF=55 BAR_END TRACK_END PIECE_END

I hope this concise overview provides clarity on how MMM operates. Now, to the fun part! Let's take the LMD Clean and convert it to pseudo-words.

To tokenize the dataset, you can leverage open-source libraries like MidiTok (mentioned above) or Musicaiz. Both offer great features to customize your tokenization process. However, I decided to use the MMM-JSB repo as a starting point and adapt it to the Lakh Midi Dataset because I could have more control over the process. You can find the adapted repo here.

The adapted repo removes files with multiple time signatures or times signatures that are not 4/4. Besides, it adds a GENRE= token so you can control in inference the genre you want your model to generate. Finally, I decided not to quantize the notes so the sounds are less robotic.

You can utilize this notebook for dataset tokenization. However, be mindful that the process can be time-consuming, and you might encounter errors. If you want to skip this process, I uploaded the tokenized dataset to the Hub, and it is ready for you to use! Hugging Face allows you to upload the dataset easily. In my case, I created a data frame to do some cleaning and basic data exploration and uploaded the final data frame as a dataset to the Hub.

# Install datasets
!pip install datasets

# Collect files from the right folder
import glob
dataset_files = glob.glob("/path/to/tokenized/dataset/*.txt")

# Load files as HF dataset
from datasets import load_dataset
dataset = load_dataset("text", data_files=dataset_files)

# Convert dataset to dataframe
ds = dataset["train"]
df = ds.to_pandas()

# Some data cleaning and exploration...

from datasets import Dataset
# Convert the DataFrame to a Hugging Face dataset
clean_dataset = Dataset.from_pandas(df)

# Log in to your Hugging Face account
from huggingface_hub import notebook_login
notebook_login()

# Push dataset to your account, replace juancopi81 with your user
clean_dataset.push_to_hub("juancopi81/mmm_track_lmd_8bars_nots")

Feel free to examine the complete notebook.

Training the Tokenizer and the Model.

At this point, your dataset should be formatted into pseudo-words. Remember, you could collect one by following the previous part of the tutorial, or you could use the one prepared on the Hub. You could also use a smaller dataset to test this part of the tutorial or if you are low on resources. For instance, I recommend the js-fakes-4bars dataset as a simpler alternative that will work fine. I'll add links to the respective notebooks based on your chosen dataset (LMD | JS Fake).

Since you now have a dataset of pseudo-words, the next part will be very similar to training a language model, but the language is composed of music words. Indeed, this part of the tutorial heavily follows the Hugging Face NLP course, where you need to train a tokenizer after having a dataset.

Note: If you are unfamiliar with tokenization or model training, I encourage you to review the course to understand this tutorial better.

Tokenizer

Colab for following this part of the tutorial: LMD | JS Fake

For this tutorial, you will be training a GPT-2 model. This model has excellent learning power, is open-source, and Hugging Face has done a great job facilitating its training and usage. But GPT-2 was not trained in a music language, so you must re-training from scratch, starting with the tokenizer.

To illustrate the previous point, let's tokenize some words of our dataset with the default GPT-2 tokenizer:

# Take some sample from the dataset
sample_10 = raw_datasets["train"]["text"][10]
sample = sample_10[:242]
sample
PIECE_START  GENRE=POP TRACK_START INST=35 DENSITY=1 BAR_START TIME_DELTA=6.0 NOTE_ON=40 TIME_DELTA=4.0 NOTE_ON=32 TIME_DELTA=0.10833333333333428 NOTE_OFF=40 TIME_DELTA=5.533333333333331 NOTE_OFF=32 BAR_END BAR_START NOTE_ON=31 TIME_DELTA=6.0
# Default GPT-2 tokenizer applied to our dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer(sample).tokens())
['PI', 'EC', 'E', '_', 'ST', 'ART', 'Ġ', 'ĠGEN', 'RE', '=', 'P', 'OP', 'ĠTR', 'ACK', '_', 'ST', 'ART', 'ĠINST', '=', '35', 'ĠD', 'ENS', 'ITY', '=', '1', 'ĠBAR', '_', 'ST', 'ART', 'ĠTIME', '_', 'D', 'EL', 'TA', '=', '6', '.', '0', 'ĠNOTE', '_', 'ON', '=', '40', 'ĠTIME', '_', 'D', 'EL', 'TA', '=', '4', '.', '0', 'ĠNOTE', '_', 'ON', '=', '32', 'ĠTIME', '_', 'D', 'EL', 'TA', '=', '0', '.', '108', '3333', '3333', '33', '34', '28', 'ĠNOTE', '_', 'OFF', '=', '40', 'ĠTIME', '_', 'D', 'EL', 'TA', '=', '5', '.', '5', '3333', '3333', '3333', '31', 'ĠNOTE', '_', 'OFF', '=', '32', 'ĠBAR', '_', 'END', 'ĠBAR', '_', 'ST', 'ART', 'ĠNOTE', '_', 'ON', '=', '31', 'ĠTIME', '_', 'D', 'EL', 'TA', '=', '6', '.', '0']

As seen, the default GPT-2 tokenizer struggles with music tokens. We'll need a custom approach for better results.

For training a tokenizer, you would usually start by normalizing the words. This step includes removing needless whitespace, lowercasing the words, and removing accents. This step, essential for natural languages, is not needed with the music tokens you have.

The next step is pre-tokenization, where you split the inputs into smaller entities, like words. In our case, breaking the inputs based on the white split is enough:

from tokenizers import Tokenizer
from tokenizers.models import WordLevel

# We need to specify the UNK token
new_tokenizer = Tokenizer(model=WordLevel(unk_token="[UNK]"))

# Add pretokenizer
from tokenizers.pre_tokenizers import WhitespaceSplit

new_tokenizer.pre_tokenizer = WhitespaceSplit()

# Let's test our pre_tokenizer
new_tokenizer.pre_tokenizer.pre_tokenize_str(sample)
[('PIECE_START', (0, 11)),
 ('GENRE=POP', (13, 22)),
 ('TRACK_START', (23, 34)),
 ('INST=35', (35, 42)),
 ('DENSITY=1', (43, 52)),
 ('BAR_START', (53, 62)),
 ('TIME_DELTA=6.0', (63, 77)),
 ('NOTE_ON=40', (78, 88)),
 ('TIME_DELTA=4.0', (89, 103)),
 ('NOTE_ON=32', (104, 114)),
 ('TIME_DELTA=0.10833333333333428', (115, 145)),
 ('NOTE_OFF=40', (146, 157)),
 ('TIME_DELTA=5.533333333333331', (158, 186)),
 ('NOTE_OFF=32', (187, 198)),
 ('BAR_END', (199, 206)),
 ('BAR_START', (207, 216)),
 ('NOTE_ON=31', (217, 227)),
 ('TIME_DELTA=6.0', (228, 242))]

Finally, you train your tokenizer, do any post-processing, and (optionally but highly recommended) upload it to the Hub.

# Yield batches of 1,000 texts
def get_training_corpus():
  dataset = raw_datasets["train"]
  for i in range(0, len(dataset), 1000):
    yield dataset[i : i + 1000]["text"]

from tokenizers.trainers import WordLevelTrainer

# Add special tokens
trainer = WordLevelTrainer(
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

# Post-processing and updloading it to the Hub
from transformers import PreTrainedTokenizerFast

new_tokenizer.save("tokenizer.json")

new_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
new_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

new_tokenizer.push_to_hub("lmd_8bars_tokenizer")

Let's see how your tokenizer is working after training:

['PIECE_START', 'GENRE=POP', 'TRACK_START', 'INST=35', 'DENSITY=1', 'BAR_START', 'TIME_DELTA=6.0', 'NOTE_ON=40', 'TIME_DELTA=4.0', 'NOTE_ON=32', 'TIME_DELTA=0.10833333333333428', 'NOTE_OFF=40', 'TIME_DELTA=5.533333333333331', 'NOTE_OFF=32', 'BAR_END', 'BAR_START', 'NOTE_ON=31', 'TIME_DELTA=6.0']

Just what we wanted! Fantastic job! You now have a tokenizer in the Hub for training a GPT-2 model.

Model

Colab for following this part of the tutorial: LMD | JS Fake

Now that your dataset and tokenizer are ready, it is time to train the model. In this part of the tutorial, you will:

  • Prepare the dataset for the model.
  • Select the model's configuration.
  • Train the model with a custom trainer. The custom trainer will allow you to log the results of the model while training in Weights and Biases (you need a W&B account for this).

Preparing the Dataset

You've done the hard work already, so preparing your dataset is straightforward. You need to grab your dataset from Hugging Face and use your new tokenizer to create your tokenized dataset. This tokenized version of the dataset is what GPT-2 expects as its input.

# Import libraries
from datasets import load_dataset
from transformers import AutoTokenizer

# Download dataset - you can change it for your own dataset
ds = load_dataset("juancopi81/mmm_track_lmd_8bars_nots", split="train")
# We had only "train" in the ds, so we can create a test and train split
raw_datasets = ds.train_test_split(test_size=0.1, shuffle=True)
# Change for respective tokenizer
tokenizer = AutoTokenizer.from_pretrained("juancopi81/lmd_8bars_tokenizer")
raw_datasets

raw_datasets now contains the train and test split.

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 159810
    })
    test: Dataset({
        features: ['text'],
        num_rows: 17757
    })
})

Let's now tokenize the entire dataset. There are many approaches to doing this. In this tutorial, you will truncate any text (song) longer than your defined context_length. In transformer models, context_length represents the maximum sequence length (tokens) the model can handle. This length is often constrained due to memory considerations and the model's architecture.

# You can replace this, 2048 seems a good number here
context_length = 2048

# Function for tokenizing the dataset
def tokenize(element):
  outputs = tokenizer(
      element["text"],
      truncation=True, #Removing element longer that context size, no effect in JSB
      max_length=context_length,
      padding=False
  )
  return {"input_ids": outputs["input_ids"]}

# Create tokenized_dataset. We use map to pass each element of our dataset to tokenize and remove unnecessary columns.
tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)

tokenized_datasets
DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 159810
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 17757
    })
})

tokenized_dataset has the input_ids you need for training the model.

Selecting the Model Configuration

For this tutorial, you will be training a GPT-2 model. You can configure different sizes of a GPT-2 model, which is a critical decision when setting up your model. I added some code in the notebook to determine the model's size using some scaling laws results from the Chinchilla paper (a study that analyzes the relationship between model size, data, and performance). I adapted this part of the notebook from Karpathy's implementation.

Note: I'm currently refining this part of the tutorial, so it's still a work in progress. As I make updates, I'll be refreshing the notebook accordingly. Feedback is always welcome!

For this tutorial, let's use a small version (few parameters) that will allow you to train the model with more constrained resources and, after training, generate music faster. Indeed, the demo you saw at the tutorial's beginning does not use GPU and still creates music at reasonable times.

# Change this based on size of the data
n_layer=6 # Number of transformer layers
n_head=8 # Number of multi-head attention heads
n_emb=512 # Embedding size

from transformers import AutoConfig, GPT2LMHeadModel

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_positions=context_length,
    n_layer=n_layer,
    n_head=n_head,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    n_embd=n_emb
)

model = GPT2LMHeadModel(config)

Data Collator

Before starting training, you need to create the batches for your model. Besides, recall that the inputs act as labels in a Causal Language Model (shifted by one element), so you must take care of that too. But worry not, the data collator from Hugging Face will do just that for us: 🤗 definitely makes our lives easier!

from transformers import DataCollatorForLanguageModeling
# It supports both masked language modeling (MLM) and causal language modeling (CLM)
# We need to set mlm=False for CLM
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Training the Model

You have put all the pieces together, and now it is the moment of truth: Training the model! You won't want to be blind while your model is training, so it is always a good idea to test some generations during the process. This part is gratifying: You will listen to how your AI music evolves as epochs go by.

To do this, you will need a Weights and Biases account and customize the trainer so it logs music in the eval_loop. Please refer to the notebook for the details, and here you can see the critical snippet:

from transformers import Trainer, TrainingArguments

# first create a custom trainer to log prediction distribution
SAMPLE_RATE=44100
class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def evaluation_loop(
        self,
        dataloader,
        description,
        prediction_loss_only=None,
        ignore_keys=None,
        metric_key_prefix="eval",
    ):
        # call super class method to get the eval outputs
        eval_output = super().evaluation_loop(
            dataloader,
            description,
            prediction_loss_only,
            ignore_keys,
            metric_key_prefix,
        )

        # log the prediction distribution using `wandb.Histogram` method.
        if wandb.run is not None:
            input_ids = tokenizer.encode("PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START", return_tensors="pt").cuda()
            # Generate more tokens.
            voice1_generated_ids = model.generate(
                input_ids,
                max_length=512,
                do_sample=True,
                temperature=0.75,
                eos_token_id=tokenizer.encode("TRACK_END")[0]
            )
            voice2_generated_ids = model.generate(
                voice1_generated_ids,
                max_length=512,
                do_sample=True,
                temperature=0.75,
                eos_token_id=tokenizer.encode("TRACK_END")[0]
            )
            voice3_generated_ids = model.generate(
                voice2_generated_ids,
                max_length=512,
                do_sample=True,
                temperature=0.75,
                eos_token_id=tokenizer.encode("TRACK_END")[0]
            )
            voice4_generated_ids = model.generate(
                voice3_generated_ids,
                max_length=512,
                do_sample=True,
                temperature=0.75,
                eos_token_id=tokenizer.encode("TRACK_END")[0]
            )
            token_sequence = tokenizer.decode(voice4_generated_ids[0])
            note_sequence = token_sequence_to_note_sequence(token_sequence)
            synth = note_seq.fluidsynth
            array_of_floats = synth(note_sequence, sample_rate=SAMPLE_RATE)
            int16_data = note_seq.audio_io.float_samples_to_int16(array_of_floats)
            wandb.log({"Generated_audio": wandb.Audio(int16_data, SAMPLE_RATE)})


        return eval_output

With your custom trainer in place, you can start training the model. As starters, I used the following parameters:

# Create the args for out trainer
from argparse import Namespace

# Get the output directory with timestamp.
output_path = "output"
steps = 5000
# Commented parameters
config = {"output_dir": output_path,
          "num_train_epochs": 1,
          "per_device_train_batch_size": 8,
          "per_device_eval_batch_size": 4,
          "evaluation_strategy": "steps",
          "save_strategy": "steps",
          "eval_steps": steps,
          "logging_steps":steps,
          "logging_first_step": True,
          "save_total_limit": 5,
          "save_steps": steps,
          "lr_scheduler_type": "cosine",
          "learning_rate":5e-4,
          "warmup_ratio": 0.01,
          "weight_decay": 0.01,
          "seed": 1,
          "load_best_model_at_end": True,
          "report_to": "wandb"}

args = Namespace(**config)

Let's use them in our CustomTrainer:

train_args = TrainingArguments(**config)

# Use the CustomTrainer created above
trainer = CustomTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

And launch your training:

# Train the model.
trainer.train()

Using Sweeps to Find Better Hyperparameters

In the previous section, you trained your music generation model. That's great! Let's now search for better hyperparameters for your model. There are different approaches to doing this; I decided to implement "sweeps" from Weights and Biases due to its user interface and ease of use.

Setting up your sweeps in W&B requires first that you organize your code. In this step, you will chunk the previous notebook into a series of functions you can call with various possible arguments. You can see an example of doing this in the following links:

After organizing the code, you can define your sweep configuration in a YAML file or a Python dictionary. This configuration will explain to W&B the strategy you want to implement for exploring the hyperparameters. Let's explore this file:

# The program to run
program: train.py

# Method can be grid, random or bayes
method: random

# Project this sweep is part of
project: mlops-001-lmdGPT

# Metric to optimize
metric:
  name: eval/loss
  goal: minimize

# Parameters space to search
parameters:
  learning_rate:
    distribution: log_uniform_values
    min: 5e-4
    max: 3e-3
  gradient_accumulation_steps:
    values: [1, 2, 4]

For this tutorial, the YAML file is configured to explore hyperparameters for the learning_rate and the gradient_accumulation_steps, two of the most impactful numbers for the performance of your training process. Feel free to experiment with this and share your results!

To run your sweep, follow these steps:

1. Initialize your sweep:

wandb sweep sweep.yaml

2. Start your sweep agent(s): The {wandb agent} value can be taken from the output of the previous step. The {runs for this agent} represents the maximum number of trials the agent should undertake for finding the best hyperparameters:

wandb agent {wandb agent} --count {runs for this agent}

I prepare a notebook for running your agents once your organized code is in GitHub (LMD | JS Fake)

You can then look at the results of your sweeps in your W&B account to conclude your analysis:

Tokenizing Dataset
Fig: Example of how a sweep looks in W&B.

You can now launch your script for training your model with the best hyperparameters in your analysis. For instance:

python train.py --learning_rate=0.0005 --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --num_train_epochs=10 --push_to_hub=True --eval_steps=4994 --logging_steps=4994 --save_steps=4994 --output_dir="lmd-8bars-2048-epochs10" --gradient_accumulation_steps=2

And with that, we can conclude the part of training your model and tokenizer. I hope you are as excited as I am for what comes next: Showcasing your model 💪🏾.

Showcasing the Model in a 🤗 Space

With your model trained and ready, it's time to show it off! You can create the User Interface (UI) of your model with Gradio and host your app as a Hugging Face Space. In this part of the tutorial, we'll do just that together. Remember, you will need a Hugging Face account for this.

After creating a new space, you can decide which SDK to use. For this tutorial, you will use Docker to have more control over your app's environment. The following is the Dockerfile I added for the ML demo:

FROM ubuntu:20.04

WORKDIR /code

# So users can share results with the community - Adjust to your specific use case
ENV SYSTEM=spaces
ENV SPACE_ID=juancopi81/multitrack-midi-music-generator

COPY ./requirements.txt /code/requirements.txt

# Preconfigure tzdata
RUN DEBIAN_FRONTEND="noninteractive" apt-get -qq update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y tzdata

# Some important packages for playing the generated music
RUN apt-get update -qq && \
    apt-get install -qq python3-pip build-essential libasound2-dev libjack-dev wget cmake pkg-config libglib2.0-dev ffmpeg

# Download libfluidsynth source
RUN wget https://github.com/FluidSynth/fluidsynth/archive/refs/tags/v2.3.3.tar.gz && \
    tar xzf v2.3.3.tar.gz && \
    cd fluidsynth-2.3.3 && \
    mkdir build && \
    cd build && \
    cmake .. && \
    make && \
    make install && \
    cd ../../ && \
    rm -rf fluidsynth-2.3.3 v2.3.3.tar.gz

ENV LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
RUN ldconfig

RUN pip3 install --no-cache-dir --upgrade -r /code/requirements.txt

# Set up a new user named "user" with user ID 1000
RUN useradd -m -u 1000 user

# Switch to the "user" user
USER user

# Set home to the user's home directory
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:$PATH

# Set the working directory to the user's home directory
WORKDIR $HOME/app

# Copy the current directory contents into the container at $HOME/app setting the owner to the user
COPY --chown=user . $HOME/app

CMD ["python3", "main.py"]

You are setting an image from Ubuntu and installing necessary packages like FluidSynth, which you will use to play the sound of the generated music. There are other essential Python packages in the requirements.txt file. Feel free to examine it.

Another critical part of your app is how to go from the tokens generated by the model to the music notes. I've been hiding this function throughout the tutorial, but let's see how it works. Essentially, the function uses Magenta's note_seq library to create a note_sequence that you can use to convert it to MIDI or play it. Here is the code for that, and the attribution goes totally to Dr. Tristan Behrens.

from typing import Optional

from note_seq.protobuf.music_pb2 import NoteSequence
from note_seq.constants import STANDARD_PPQ


def token_sequence_to_note_sequence(
    token_sequence: str,
    qpm: float = 120.0,
    use_program: bool = True,
    use_drums: bool = True,
    instrument_mapper: Optional[dict] = None,
    only_piano: bool = False,
) -> NoteSequence:
    """
    Converts a sequence of tokens into a sequence of notes.
    Args:
        token_sequence (str): The sequence of tokens to convert.
        qpm (float, optional): The quarter notes per minute. Defaults to 120.0.
        use_program (bool, optional): Whether to use program. Defaults to True.
        use_drums (bool, optional): Whether to use drums. Defaults to True.
        instrument_mapper (Optional[dict], optional): The instrument mapper. Defaults to None.
        only_piano (bool, optional): Whether to only use piano. Defaults to False.
    Returns:
        NoteSequence: The resulting sequence of notes.
    """
    if isinstance(token_sequence, str):
        token_sequence = token_sequence.split()

    note_sequence = empty_note_sequence(qpm)

    # Compute note and bar lengths based on the provided QPM
    note_length_16th = 0.25 * 60 / qpm
    bar_length = 4.0 * 60 / qpm

    # Render all notes.
    current_program = 1
    current_is_drum = False
    current_instrument = 0
    track_count = 0
    for _, token in enumerate(token_sequence):
        if token == "PIECE_START":
            pass
        elif token == "PIECE_END":
            break
        elif token == "TRACK_START":
            current_bar_index = 0
            track_count += 1
            pass
        elif token == "TRACK_END":
            pass
        elif token == "KEYS_START":
            pass
        elif token == "KEYS_END":
            pass
        elif token.startswith("KEY="):
            pass
        elif token.startswith("INST"):
            instrument = token.split("=")[-1]
            if instrument != "DRUMS" and use_program:
                if instrument_mapper is not None:
                    if instrument in instrument_mapper:
                        instrument = instrument_mapper[instrument]
                current_program = int(instrument)
                current_instrument = track_count
                current_is_drum = False
            if instrument == "DRUMS" and use_drums:
                current_instrument = 0
                current_program = 0
                current_is_drum = True
        elif token == "BAR_START":
            current_time = current_bar_index * bar_length
            current_notes = {}
        elif token == "BAR_END":
            current_bar_index += 1
            pass
        elif token.startswith("NOTE_ON"):
            pitch = int(token.split("=")[-1])
            note = note_sequence.notes.add()
            note.start_time = current_time
            note.end_time = current_time + 4 * note_length_16th
            note.pitch = pitch
            note.instrument = current_instrument
            note.program = current_program
            note.velocity = 80
            note.is_drum = current_is_drum
            current_notes[pitch] = note
        elif token.startswith("NOTE_OFF"):
            pitch = int(token.split("=")[-1])
            if pitch in current_notes:
                note = current_notes[pitch]
                note.end_time = current_time
        elif token.startswith("TIME_DELTA"):
            delta = float(token.split("=")[-1]) * note_length_16th
            current_time += delta
        elif token.startswith("DENSITY="):
            pass
        elif token == "[PAD]":
            pass
        else:
            pass

    # Make the instruments right.
    instruments_drums = []
    for note in note_sequence.notes:
        pair = [note.program, note.is_drum]
        if pair not in instruments_drums:
            instruments_drums += [pair]
        note.instrument = instruments_drums.index(pair)

    if only_piano:
        for note in note_sequence.notes:
            if not note.is_drum:
                note.instrument = 0
                note.program = 0

    return note_sequence


def empty_note_sequence(qpm: float = 120.0, total_time: float = 0.0) -> NoteSequence:
    """
    Creates an empty note sequence.
    Args:
        qpm (float, optional): The quarter notes per minute. Defaults to 120.0.
        total_time (float, optional): The total time. Defaults to 0.0.
    Returns:
        NoteSequence: The empty note sequence.
    """
    note_sequence = NoteSequence()
    note_sequence.tempos.add().qpm = qpm
    note_sequence.ticks_per_quarter = STANDARD_PPQ
    note_sequence.total_time = total_time
    return note_sequence

In the utils.py file you can find the function that handles the generation of the model. I decided to generate one instrument at a time so users can have more control over the music synthesis:

def generate_new_instrument(seed: str, temp: float = 0.75) -> str:
    """
    Generates a new instrument sequence from a given seed and temperature.
    Args:
        seed (str): The seed string for the generation.
        temp (float, optional): The temperature for the generation, which controls the randomness. Defaults to 0.75.
    Returns:
        str: The generated instrument sequence.
    """
    seed_length = len(tokenizer.encode(seed))

    while True:
        # Encode the conditioning tokens.
        input_ids = tokenizer.encode(seed, return_tensors="pt")

        # Move the input_ids tensor to the same device as the model
        input_ids = input_ids.to(model.device)

        # Generate more tokens.
        eos_token_id = tokenizer.encode("TRACK_END")[0]
        generated_ids = model.generate(
            input_ids,
            max_new_tokens=2048,
            do_sample=True,
            temperature=temp,
            eos_token_id=eos_token_id,
        )
        generated_sequence = tokenizer.decode(generated_ids[0])

        # Check if the generated sequence contains "NOTE_ON" beyond the seed
        new_generated_sequence = tokenizer.decode(generated_ids[0][seed_length:])
        if "NOTE_ON" in new_generated_sequence:
            # If NOTE_ON we return it, we are generating one instrument at a time
            return generated_sequence

This utils file also contains the code to remove, change, or regenerate an instrument, among other vital processes.

Finally, in the main.py file, you add the buttons that users can click to interact with the model.

# Code snippet of clickable buttons
def run():
    with demo:
        gr.HTML(DESCRIPTION)
        gr.DuplicateButton(value="Duplicate Space for private use")
        with gr.Row():
            with gr.Column():
                temp = gr.Slider(
                    minimum=0, maximum=1, step=0.05, value=0.85, label="Temperature"
                )
                genre = gr.Dropdown(
                    choices=genres, value="POP", label="Select the genre"
                )
                with gr.Row():
                    btn_from_scratch = gr.Button("🧹 Start from scratch")
                    btn_continue = gr.Button("➡️ Continue Generation")
                    btn_remove_last = gr.Button("↩️ Remove last instrument")
                    btn_regenerate_last = gr.Button("🔄 Regenerate last instrument")
            with gr.Column():
                with gr.Box():
                    audio_output = gr.Video(show_share_button=True)
                    midi_file = gr.File()
                    with gr.Row():
                        qpm = gr.Slider(
                            minimum=60, maximum=140, step=10, value=120, label="Tempo"
                        )
                        btn_qpm = gr.Button("Change Tempo")
        with gr.Row():
            with gr.Column():
                plot_output = gr.Plot()
            with gr.Column():
                instruments_output = gr.Markdown("# List of generated instruments")
        with gr.Row():
            text_sequence = gr.Text()
            empty_sequence = gr.Text(visible=False)
        with gr.Row():
            num_tokens = gr.Text(visible=False)
        btn_from_scratch.click(
            fn=generate_song,
            inputs=[genre, temp, empty_sequence, qpm],
            outputs=[
                audio_output,
                midi_file,
                plot_output,
                instruments_output,
                text_sequence,
                num_tokens,
            ],
        )
        btn_continue.click(
            fn=generate_song,
            inputs=[genre, temp, text_sequence, qpm],
            outputs=[
                audio_output,
                midi_file,
                plot_output,
                instruments_output,
                text_sequence,
                num_tokens,
            ],
        )
        btn_remove_last.click(
            fn=remove_last_instrument,
            inputs=[text_sequence, qpm],
            outputs=[
                audio_output,
                midi_file,
                plot_output,
                instruments_output,
                text_sequence,
                num_tokens,
            ],
        )
        btn_regenerate_last.click(
            fn=regenerate_last_instrument,
            inputs=[text_sequence, qpm],
            outputs=[
                audio_output,
                midi_file,
                plot_output,
                instruments_output,
                text_sequence,
                num_tokens,
            ],
        )
        btn_qpm.click(
            fn=change_tempo,
            inputs=[text_sequence, qpm],
            outputs=[
                audio_output,
                midi_file,
                plot_output,
                instruments_output,
                text_sequence,
                num_tokens,
            ],
        )

    demo.launch(server_name="0.0.0.0", server_port=7860)

Let's see how the buttons look in the interface:

UI to interact with the music model
User interface showcasing various buttons for interacting with the music generation model.

And you can now share your model with everyone. Having this great model and sharing it with the world is cool, but it is even cooler if you consider the broader impacts. Let's think about that together in the next section.

Considering Ethical Implications

First, thank you for making it this far in the tutorial. It is a lengthy tutorial and could be an intimidating one. While I've put in my best efforts to ensure the accuracy and quality of this tutorial, I acknowledge that there might be areas of improvement or potential errors. I'm continuously learning and growing, and I appreciate any feedback or suggestions to enhance the content. Your insights will benefit future readers and contribute to my learning journey. I would be thrilled if even just one part of this tutorial aids your learning process.

On the other hand, since I started the tutorial, I decided to include some thoughts about the ethical implications. I am not an expert on this topic, and I encourage you to seek out the insights and perspectives of experts in the field. Nevertheless, I wanted to share some of my understanding and concerns.

There are many things to consider about generating music with AI:

  • What is the role of the system in the creative process?
  • What impact could these models have on the labor market of musicians?
  • Are we respecting the rights of the artists who created the music we use to train the models?
  • Who is the owner of the generated music?

The list goes on. It would be impossible to cover all these questions, so I'd like to focus on one aspect that especially concerns me: The digital divide.

The digital divide "is the unequal access to digital technology" (source: Wikipedia) that creates a dangerous bridge between those who have access to information and resources and those who don't.

What about the digital divide in the music realm?

Music is the universal language of mankind - Henry Wadsworth Longfellow

Music is a universal language that transcends borders, cultures, and epochs. It exists in every civilization. Still, vibrant traditions are being overshadowed and even forgotten in favor of mainstream music partly because certain groups have more access to digital platforms and music creation and distribution tools.

Machine Learning, notably when democratized, could be a tool to preserve and integrate underrepresented music in our days. Indeed, with suitable datasets, machine learning models could analyze, generate, and classify marginalized music, among other tasks. But it could also amplify and perpetuate biases, as is the case now with most of us training models to generate Rock, Jazz, or Classical European Music. In fact, the community has come up with the term Bach Faucet since many models can now synthesize music almost identical to Bach: "A Bach Faucet is a situation where a generative system makes an endless supply of some content at or above the quality of some culturally-valued original, but the endless supply of it makes it no longer rare, and thus less valuable" (via Twitter).

Besides, incorporating other artistic traditions could only enhance the final models and enrich the music creations. Innovative artists are demonstrating this potential, like Hexorsismos or Yaboi Hanoi, who won the 2022 AI Song Contest with melodies and sound designs inspired by the Thai culture.

There are many challenges to having a more diverse representation of music in AI, including data availability, investment, education, and others. The open-source community is uniquely positioned to address some of these challenges by collaborating on creating more diverse datasets, developing inclusive models, sharing free tutorials, or, in general, designing tools that honor a broader range of traditions.

Beyond music, Machine Learning is shaping an entirely new reality - it is the new electricity, as Prof. Andrew Ng presents it. We might be approaching an AI revolution that could change our world, and we must all have a voice and participate in this trajectory, no matter our language, culture, ethnicity, education, or nationality. Think about the risks of a robust tool controlled by a few influential individuals or groups.

I invite you to participate in more inclusive AI progress actively. As a concrete example, you can join the open-source community, no matter your level of expertise. Every contribution counts, and the combined forces of motivated people could do wonders. You can find many opportunities to collaborate at any level of knowledge in the Hugging Face discord. Finally, being from Colombia, I'd like to start a quality dataset (MIDI or Audio) in Hugging Face with under-represented music from Latin America. Let me know if you want to join forces 💪🏾.