Create README.md (#1)

Browse files

- Create README.md (f57d0b8ce9a1f44a6e7f890c9043746fc93f09b2)

Co-authored-by: Christoph Stumpf <cstub@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +235 -0

README.md ADDED Viewed

	@@ -0,0 +1,235 @@

+---
+license: apache-2.0
+inference: false
+pipeline_tag: audio-to-audio
+---
+# Perceiver AR symbolic audio model
+This model is a [Perceiver AR](https://arxiv.org/abs/2202.07765) symbolic audio model (134M parameters) pretrained on
+the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset for 27 epochs (157M tokens). It uses [rotary embedding](https://arxiv.org/abs/2104.09864)
+for relative position encoding. It is a [training example](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#giantmidi-piano)
+of the [perceiver-io](https://github.com/krasserm/perceiver-io) library.
+## Model description
+Perceiver AR is a simple extension of a plain decoder-only transformer such as GPT-2, for example. A core building block
+of both is the *decoder layer* consisting of a self-attention layer followed by a position-wise MLP. Self-attention uses
+a causal attention mask.
+Perceiver AR additionally cross-attends to a longer prefix of the input sequence in its first attention layer. This layer
+is a hybrid self- and cross-attention layer. Self-attention is over the last n positions of the input sequence, with a
+causal attention mask, cross-attention is from the last n positions to the first m positions. The length of the input
+sequence is m + n. This allows a Perceiver AR to process a much larger context than decoder-only transformers which are
+based on self-attention only.
+<p align="center">
+<img src="https://krasserm.github.io/img/2023-01-23/perceiver-ar.png" alt="Perceiver AR" width="600"/><br/>
+<i>Fig. 1</i>. Attention in Perceiver AR with m=8 prefix tokens and n=3 latent tokens.
+<p/>
+The output of the hybrid attention layer are n latent arrays corresponding to the last n tokens of the input sequence.
+These are further processed by a stack of L-1 decoder layers where the total number of attention layers is L. A final
+layer (not shown in Fig. 1) predicts the target token for each latent position. The weights of the final layer are
+shared with the input embedding layer. Except for the initial cross-attention to the prefix sequence, a Perceiver AR
+is architecturally identical to a decoder-only transformer.
+## Model training
+The model was [trained](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#giantmidi-piano) with
+the task of symbolic audio modeling on the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset
+for 27 epochs (157M tokens). This dataset consists of [MIDI](https://en.wikipedia.org/wiki/MIDI) files, tokenized using the
+approach from the [Perceiver AR paper](https://arxiv.org/pdf/2202.07765.pdf), which is described
+in detail in Section A.2 of [Huang et al (2019)](https://arxiv.org/abs/1809.04281).
+All hyperparameters are summarized in the [training script](https://github.com/krasserm/perceiver-io/blob/main/examples/training/sam/giantmidi/train.sh).
+The context length was set to 6144 tokens with 2048 latent positions, resulting in a maximal prefix length of 4096. The
+actual prefix length per example was randomly chosen between 0 and 4096. Training was done with [PyTorch Lightning](https://www.pytorchlightning.ai/index.html)
+and the resulting checkpoint was converted to this 🤗 model with a library-specific [conversion utility](#checkpoint-conversion).
+## Intended use and limitations
+This model can be used for audio generation with a user-defined initial number of latent tokens. It mainly exists for
+demonstration purposes on how to train Perceiver AR models with the [perceiver-io library](https://github.com/krasserm/perceiver-io).
+To improve on the quality of the generated audio samples a much larger dataset than
+[GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) is required for training.
+## Usage examples
+To use this model you first need to [install](https://github.com/krasserm/perceiver-io/blob/main/README.md#installation)
+the `perceiver-io` library with extension `audio`.
+```shell
+pip install perceiver-io[audio]
+```
+Then the model can be used with PyTorch. Either use the model directly to generate MIDI files:
+```python
+import torch
+from perceiver.model.audio.symbolic import PerceiverSymbolicAudioModel
+from perceiver.data.audio.midi_processor import decode_midi, encode_midi
+from pretty_midi import PrettyMIDI
+repo_id = "krasserm/perceiver-ar-sam-giant-midi"
+model = PerceiverSymbolicAudioModel.from_pretrained(repo_id)
+prompt_midi = PrettyMIDI("prompt.mid")
+prompt = torch.tensor(encode_midi(prompt_midi)).unsqueeze(0)
+output = model.generate(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)
+output_midi = decode_midi(output[0].cpu().numpy())
+type(output_midi)
+```
+```
+pretty_midi.pretty_midi.PrettyMIDI
+```
+use a `symbolic-audio-generation` pipeline to generate a MIDI output:
+```python
+from transformers import pipeline
+from pretty_midi import PrettyMIDI
+from perceiver.model.audio import symbolic  # auto-class registration
+repo_id = "krasserm/perceiver-ar-sam-giant-midi"
+prompt = PrettyMIDI("prompt.mid")
+audio_generator = pipeline("symbolic-audio-generation", model=repo_id)
+output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)
+type(output["generated_audio_midi"])
+```
+```
+pretty_midi.pretty_midi.PrettyMIDI
+```
+or generate WAV output by rendering the MIDI symbols using [fluidsynth](https://www.fluidsynth.org/) (Note: fluidsynth must be installed
+in order for the following example to work):
+```python
+from transformers import pipeline
+from pretty_midi import PrettyMIDI
+from perceiver.model.audio import symbolic  # auto-class registration
+repo_id = "krasserm/perceiver-ar-sam-giant-midi"
+prompt = PrettyMIDI("prompt.mid")
+audio_generator = pipeline("symbolic-audio-generation", model=repo_id)
+output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0, render=True)
+with open("generated_audio.wav", "wb") as f:
+    f.write(output["generated_audio_wav"])
+```
+## Audio samples
+The following (hand-picked) audio samples were generated using various prompts from the validation subset of
+the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset. The input prompts are
+not included in the audio output.
+<table>
+  <tr>
+    <th>Audio sample</th>
+    <th>Top-K</th>
+    <th>Top-p</th>
+    <th>Temperature</th>
+    <th>Prefix length</th>
+    <th>Latents</th>
+  </tr>
+  <tr>
+    <td>
+      <audio controls>
+        <source src="https://martin-krasser.com/perceiver/data/midi/01_nehrlich_continuation.wav" type="audio/wav">
+        Your browser does not support the audio element.
+      </audio>
+    </td>
+    <td style="vertical-align: top;">-</td>
+    <td style="vertical-align: top;">0.95</td>
+    <td style="vertical-align: top;">0.95</td>
+    <td style="vertical-align: top;">4096</td>
+    <td style="vertical-align: top;">1</td>
+  </tr>
+  <tr>
+    <td>
+      <audio controls>
+        <source src="https://martin-krasser.com/perceiver/data/midi/02_eduardo_continuation.wav" type="audio/wav">
+        Your browser does not support the audio element.
+      </audio>
+    </td>
+    <td style="vertical-align: top;">-</td>
+    <td style="vertical-align: top;">0.95</td>
+    <td style="vertical-align: top;">1.0</td>
+    <td style="vertical-align: top;">4096</td>
+    <td style="vertical-align: top;">64</td>
+  </tr>
+  <tr>
+    <td>
+      <audio controls>
+        <source src="https://martin-krasser.com/perceiver/data/midi/03_membree_continuation.wav" type="audio/wav">
+        Your browser does not support the audio element.
+      </audio>
+    </td>
+    <td style="vertical-align: top;">-</td>
+    <td style="vertical-align: top;">0.95</td>
+    <td style="vertical-align: top;">1.0</td>
+    <td style="vertical-align: top;">1024</td>
+    <td style="vertical-align: top;">1</td>
+  </tr>
+  <tr>
+    <td>
+      <audio controls>
+        <source src="https://martin-krasser.com/perceiver/data/midi/04_membree_continuation.wav" type="audio/wav">
+        Your browser does not support the audio element.
+      </audio>
+    </td>
+    <td style="vertical-align: top;">15</td>
+    <td style="vertical-align: top;">-</td>
+    <td style="vertical-align: top;">1.0</td>
+    <td style="vertical-align: top;">4096</td>
+    <td style="vertical-align: top;">16</td>
+  </tr>
+  <tr>
+    <td>
+      <audio controls>
+        <source src="https://martin-krasser.com/perceiver/data/midi/05_kinscella_continuation.wav" type="audio/wav">
+        Your browser does not support the audio element.
+      </audio>
+    </td>
+    <td style="vertical-align: top;">-</td>
+    <td style="vertical-align: top;">0.95</td>
+    <td style="vertical-align: top;">1.0</td>
+    <td style="vertical-align: top;">4096</td>
+    <td style="vertical-align: top;">1</td>
+  </tr>
+</table>
+## Checkpoint conversion
+The `krasserm/perceiver-ar-sam-giant-midi` model has been created from a training checkpoint with:
+```python
+from perceiver.model.audio.symbolic import convert_checkpoint
+convert_checkpoint(
+    save_dir="krasserm/perceiver-ar-sam-giant-midi",
+    ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/sam/version_1/checkpoints/epoch=027-val_loss=1.944.ckpt",
+    push_to_hub=True,
+)
+```
+## Citation
+```bibtex
+@inproceedings{hawthorne2022general,
+  title={General-purpose, long-context autoregressive modeling with perceiver ar},
+  author={Hawthorne, Curtis and Jaegle, Andrew and Cangea, C{\u{a}}t{\u{a}}lina and Borgeaud, Sebastian and Nash, Charlie and Malinowski, Mateusz and Dieleman, Sander and Vinyals, Oriol and Botvinick, Matthew and Simon, Ian and others},
+  booktitle={International Conference on Machine Learning},
+  pages={8535--8558},
+  year={2022},
+  organization={PMLR}
+}
+```