Katpeeler
/

midi_model_3

@@ -55,7 +55,7 @@ The data encodes midi information as encoded text. Here are some examples of wha
 Training was done through Google Colab's free tier, using a single 15GB Tesla T4 GPU.
 Training was logged through Weights and Biases.
-A link to the full training notebook can be found [here] (https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=34kpyWSnaJE1)
 ### Training hyperparameters
@@ -118,4 +118,123 @@ The following hyperparameters were used during training:
 - Transformers 4.35.2
 - Pytorch 2.1.0+cu118
 - Datasets 2.15.0
-- Tokenizers 0.15.0

 Training was done through Google Colab's free tier, using a single 15GB Tesla T4 GPU.
 Training was logged through Weights and Biases.
+A link to the full training notebook can be found [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=34kpyWSnaJE1)
 ### Training hyperparameters
 - Transformers 4.35.2
 - Pytorch 2.1.0+cu118
 - Datasets 2.15.0
+- Tokenizers 0.15.0
+<hr/>
+The sections below this point serve as a user guide for the Hugging Face space found [here](https://huggingface.co/spaces/Katpeeler/Midi_space2)
+<hr/>
+# Introduction
+Midi_space2 allows the user to generate a four-bar musical progression, and listen back to it.
+There are two sections to interact with: audio generation, and token generation.
+- Audio generation contains 3 sliders:
+  - Inst number: a value that adjusts the tonality of the sound.
+  - Note number: a value that adjusts the reference pitch the sound is generated from.
+  - BPM: the beats per minute, or the speed of the sound.
+- Token generation is a secondary function, and allows the user to see what the language model generated.
+  - Please note that this section will display an "error" if used before any audio is generated.
+  - This section shows the tokens that are responsible for the audio you hear in the audio generation section.
+## Usage
+To run the demo, click on the link [here](https://huggingface.co/spaces/Katpeeler/Midi_space2)
+The demo will default to the "audio generation" tab. Here you will find the 3 sliders you can interact with. These are:
+- Inst number
+- Note number
+- BPM
+When you have selected values you want to try, click the "generate audio" button at the bottom.
+When your audio is ready, you will see the audio waveform displayed within the "audio" box, found above the sliders.
+**Note**
+Due to how audio is handled in Google Chrome, you may have to generate the audio a few times, the first time you use this demo.
+Additionaly, you may select the "Token Generation" tab, and click the "show generated tokens" button there to see the raw text data.
+## Documentation
+You can view the Google Colab notebook used for training [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=qWq2lY0vWFTD)
+- The demo is currently hosted as a Gradio application on Hugging Face Spaces.
+- For audio to be heard, we use the soundfile package.
+The core components are this gpt2 model, [js-fakes-4bars dataset](https://huggingface.co/datasets/TristanBehrens/js-fakes-4bars), and [note-seq](https://github.com/magenta/note-seq).
+The dataset was created by [Tristan Behrens](https://huggingface.co/TristanBehrens), and is a relatively small size.
+This made it perfect for training a gpt2 model through the free-tier of Google Colab. I selected this dataset after finding
+a different dataset on Huggingface, [mmm_track_lmd_8bars_nots](https://huggingface.co/datasets/juancopi81/mmm_track_lmd_8bars_nots).
+I initally used this dataset, but ran out of free-tier compute resources about 3 hours into training. This setback made me ultimately
+decide on using a smaller dataset for the time being.
+- Js-fakes dataset size: 13.7mb, 4,479 rows (The one I actually used)
+- Juancopi81 dataset size: 490mb, 177,567 rows (The one I attempted to use first)
+For the remained of this post, we will only discuss the js-fakes dataset.
+After downloading, the training split contained 3614 rows, and the test split contained 402 rows. Each entry follows this format:
+PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START INST=48 BAR_START NOTE_ON=70 TIME_DELTA=2 NOTE_OFF=70 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=69 TIME_DELTA=2 NOTE
+This data is in a very specific tokenized format, representing the information that is relevant to playing a note. Of note:
+- NOTE_ON=## : represents the start of a musical note, and which note to play, (A, B, C, etc.)
+- TIME_DELTA=4 : represents a quarter note. A half note is represented by TIME_DELTA=8, and an eigth note would be represented by TIME_DELTA=2.
+- NOTE_OFF=## : represents the end of a musical note, and which note to end.
+These text-based tokens contain the neccessary information to create Midi, a standard form of synthesized music data.
+The dataset used has already transposed between MIDI files, and this text-based format.
+This format is called "MMM", or Multi-Track Music Machine, proposed in the paper found [here](https://arxiv.org/abs/2008.06048).
+**Note**
+I created a tokenizer for this task, and uploaded it to my HuggingFace profile. However, I ended up using the auto-tokenizer from the fine-tuned model,
+so I won't be exploring that further.
+I used Tristan Behren's js-fakes-4bars tokenizer to tokenize the dataset for training. I selected a context length of 512, and truncated all text longer than that.
+This helped with using limited resources.
+The GPT-2 model used was 19.2M parameters. It was trained in steps of 300, through 10 epochs. This is the third iteration of models, and you can find the first two on my HuggingFace profile.
+I ended up using a batch size of 4 to further reduce the VRAM requirements in Google Colab. Specifics for the training can be found at the top of this page, but some fun things to note are:
+- Total training runtime: around 13 minutes
+- Training samples per second: 45.91
+- Training steps per second: 11.484
+- Average GPU watt usage: 66W
+- Average GPU temperature: 77C
+I think it's important to note the power draw of the GPUs used for training any model, as we enter into this modern era of normalizing this technology.
+I obtained those values through [Weights and Biases](https://wandb.ai/site), which I ran alongside my training.
+The training method used is outlined in a blog post by Juancopi81 [here](https://huggingface.co/blog/juancopi81/using-hugging-face-to-train-a-gpt-2-model-for-musi#showcasing-the-model-in-a-%F0%9F%A4%97-space).
+While I didn't follow that post exactly, it was of great help when learning how to do this.
+The final component to talk about is Magenta's note_seq library. This is how token sequences are transposed to note sequences, and played.
+This library is much more powerful than I am implementing, and I plan on expanding this project in the future to incorporate more features.
+The main method call for this can be found in the app.py file on the HuggingFace space, but here is a snippet of the code for NOTE_ON:
+elif token.startswith("NOTE_ON"):
+  pitch = int(token.split("=")[-1])
+  note = note_sequence.notes.add()
+  note.start_time = current_time
+  note.end_time = current_time + 4 * note_length_16th
+  note.pitch = pitch
+  note.instrument = current_instrument
+  note.program = current_program
+  note.velocity = 80
+  note.is_drum = current_is_drum
+  current_notes[pitch] = note
+In short, there are instructions for each type of token that is used in the vocabulary, and once you identify what something is supposed to be,
+it can be easily mapped to do whatever we want! Pretty cool, and it supports as many instruments as you want.