Expressive Jazz Piano Performance Modeling

Trained checkpoints for jazz piano performance modeling, covering two model families: two pretrained PerformanceRNN LSTM variants and two fine-tunes of the publicly released Aria 1B-parameter piano language model (Bradshaw & Colton 2025). All models are fine-tuned/trained on PiJAMA (Edwards et al. 2024). The code repository is https://github.com/napanto/jazz-piano-performance-modeling. The Aria fine-tunes also drive a real-time macOS application: https://github.com/napanto/aria-realtime-studio.

Models in this repo

The model names below match the headline table of the report.

`aria-full-quality/` — Aria fine-tune, full-quality (offline)

Aria 1B-parameter LLaMA-3.2-style decoder fine-tuned on the PiJAMA hawthorne split with the default 17 727-id AbsTokenizer (no sustain pedal). Architecture: medium (d=1536, 16 layers, 24 heads, RoPE, GQA, max_seq_len=8192). Best swept mean OA on the test split: 0.911. FMD vs the kong test-pool reference (CLaMP-2 encoder): 272.6.

tested.safetensors — the checkpoint reported on in the paper (4-stage train/val pipeline: retrained on TRAIN+VAL for the patience-selected epoch count, evaluated once on TEST).
deployed.safetensors — full retrain on TRAIN+VAL+TEST for the same epoch count. For deployment / listening only; test metrics are not honestly reportable on this checkpoint because it has seen the test set.

`aria-real-time/` — Aria fine-tune, real-time MLX-compatible

Same backbone but loaded from the public model-demo.safetensors checkpoint with the residual-stream embedding-projection layer preserved (medium-emb architecture, +1536×512 emb_proj). Trained on PiJAMA kong album-aware split with the 2 675-id demo tokenizer that adds explicit sustain-pedal events. Drop-in jazz replacement for the upstream aria/demo/demo_mlx.py iOS sampler. Best swept mean OA: 0.804. FMD: 233.6.

tested.safetensors, deployed.safetensors — same conventions as above.

`lstm-hawthorne/` — pretrained PerformanceRNN, hawthorne split

3-layer stacked LSTM (hidden 512, embed 512, tied I/O head, 6.46M params; paper-faithful PerformanceRNN, Oore et al. 2018). Pretrained on the 820 944-file Aria-MIDI corpus (30k steps) and fine-tuned on the PiJAMA hawthorne split with the 413-id no-pedal vocabulary. Best swept mean OA: 0.664. FMD: 427.8.

tested.pt — Stage-B equivalent (the one reported on).

`lstm-kong-pedal/` — pretrained PerformanceRNN, kong+pedal split

Same architecture but with the 314-id pedal-aware vocabulary (NOTE_ON×88 + NOTE_OFF×88 + TIME_SHIFT×100 + VELOCITY×32 + SUSTAIN_ON/OFF + 4 specials). Fine-tuned on the PiJAMA kong album-aware split. Best swept mean OA: 0.768 (only ≈0.04 below Aria real-time despite a ~150× parameter ratio). FMD: 438.6.

tested.pt

Note: a full retrain on TRAIN+VAL+TEST was not performed for the LSTMs (their compute cost is small enough that the 4-stage generalisation-honest pipeline already gives a strong deployment baseline). If you need that variant, the training script in the code repository reproduces it in ≈25 minutes on a single NVIDIA B200 (or comparable GPU).

MLX variants for macOS inference

Each Aria model also has mlx-tested/ and mlx-deployed/ directories containing:

model.safetensors — same weights as the top-level safetensors, laid out for loading via mlx.core.load() on Apple silicon.
config.json — the corresponding Aria model config (medium.json for full-quality, medium-emb.json for real-time).
For aria-real-time/mlx-* only: tokenizer-config.json, the same 2 675-id demo tokenizer the upstream aria/demo/demo_mlx.py uses.

Running on macOS

aria-real-time/mlx-tested/ is a drop-in replacement for the weights expected by the upstream aria/demo/demo_mlx.py (iOS / Apple silicon real-time sampler from EleutherAI/aria). Point that script at model.safetensors and use the bundled tokenizer-config.json:

python aria/demo/demo_mlx.py \
    --checkpoint-path /path/to/mlx-tested/model.safetensors \
    --tokenizer-config /path/to/mlx-tested/tokenizer-config.json

aria-full-quality/mlx-*/ ships the full-quality weights and the medium.json config. The upstream demo_mlx.py hardcodes the medium-emb arch, so to run these checkpoints on MLX you either:

Adapt aria.inference.model_mlx.TransformerLM to load medium instead of medium-emb (drop the emb_proj layer), or
Run inference via PyTorch with the MPS backend on macOS, using the top-level tested.safetensors / deployed.safetensors and the default AbsTokenizer (no demo tokenizer config needed).

The full-quality checkpoints are ≈2.5 GB in bf16 — they fit easily on ≥16 GB unified-memory Apple silicon for inference.

Loading from Python

Aria (any variant) on CUDA / ROCm / MPS

from aria.config import load_model_config
from aria.model import ModelConfig, TransformerLM
from safetensors.torch import load_file

model_config = ModelConfig(**load_model_config("medium"))      # or "medium-emb"
model_config.set_vocab_size(17727)                              # or 2675 for real-time
model = TransformerLM(model_config)
model.load_state_dict(load_file("tested.safetensors"), strict=False)
model.eval()

Aria on MLX (Apple silicon)

import mlx.core as mx
weights = mx.load("mlx-tested/model.safetensors")
# … then build the MLX TransformerLM as in aria.inference.model_mlx

LSTM

import torch
from src.models.performancernn_lstm import PerformanceRNNLSTM, PerformanceRNNLSTMConfig
ckpt = torch.load("tested.pt", map_location="cpu", weights_only=False)
cfg  = PerformanceRNNLSTMConfig(**ckpt["config"])
model = PerformanceRNNLSTM(cfg)
model.load_state_dict(ckpt["model_state"], strict=True)
model.eval()

(The PerformanceRNNLSTM / PerformanceRNNLSTMConfig definitions live in the code repository under src/models/performancernn_lstm.py.)

Recommended sampling settings

For the four autoregressive models the Stage-C sweep covered the 12 cells T ∈ {0.8, 1.0, 1.2} × top-k ∈ {0, 24} × min-p ∈ {0.035, 0.05}; the same cell — temperature = 1.2, top-k = 0 (no truncation), min-p = 0.035 — wins on both Mean OA and FMD for every autoregressive model in this repo. A wider post-training sweep extended temperature to 1.8 on a common kong reference and found the OA optimum lies above 1.2 (the real-time model peaks at T = 1.4).

Model	best `(T, k, p)`	Mean OA ↑	FMD ↓ (CLaMP-2)
`aria-full-quality`	(1.2, 0, min-p 0.035)	0.911	272.6
`aria-real-time`	(1.2, 0, min-p 0.035)	0.804	233.6
`lstm-kong-pedal`	(1.2, 0, min-p 0.035)	0.768	438.6
`lstm-hawthorne`	(1.2, 0, min-p 0.035)	0.664	427.8

Three robust observations from the sweep:

Temperature dominates. Bumping T from 0.8 → 1.2 buys +0.18–0.30 absolute OA on Aria at every (k, p) cell and +0.28 on both LSTM splits.
Don't truncate. top-k = 0 (no truncation) beats top-k = 24 by 0.03–0.07 OA at every (T, p) cell.
min-p is comparatively flat between 0.035 and 0.05; the smaller value wins by a small margin everywhere.

Reproducibility

All checkpoints were produced by the pipeline scripts in the code repository (scripts/aria_pipeline_per_variant.sh for the Aria variants, scripts/train_performancernn_lstm_pipeline.sh for the LSTMs). Reported metrics come from src/eval_aria_metrics.py (OA / KLD) and scripts/fmd_eval_sweeps.py (FMD with the CLaMP-2 music encoder).

License and attribution

This repository is released under CC-BY-NC-SA-4.0 — non-commercial, with attribution and share-alike — the most restrictive of the licenses of the training data and base models. Per-model provenance:

LSTM checkpoints (lstm-hawthorne, lstm-kong-pedal) were pretrained directly on Aria-MIDI (CC-BY-NC-SA-4.0) and fine-tuned on PiJAMA (CC-BY-NC) → CC-BY-NC-SA-4.0.
Aria fine-tunes (aria-full-quality, aria-real-time) start from the public Aria weights (Apache-2.0) and are fine-tuned on PiJAMA (CC-BY-NC) → effectively CC-BY-NC-4.0; the repository is tagged at the stricter license above.

Use is non-commercial / research only. Please attribute and cite PiJAMA (Edwards et al. 2024), Aria-MIDI (Bradshaw & Colton 2025) and Aria (EleutherAI, Apache-2.0). PiJAMA and Aria-MIDI are automatic transcriptions of copyrighted recordings; the underlying works remain under their original copyright, so this release is for non-commercial research only.

Citation

If you use these checkpoints, please cite the original PiJAMA + Aria papers:

Edwards, Dixon and Benetos. PiJAMA: Piano Jazz with Automatic MIDI Annotations. ISMIR Transactions, 6(1):89–102, 2024.
Bradshaw and Colton. Scaling self-supervised representation learning for symbolic piano performance. arXiv:2506.23869, 2025.
Oore, Simon, Dieleman, Eck, Simonyan. This Time with Feeling: Learning Expressive Musical Performance. Neural Computing and Applications, 32:955–967, 2020.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for napanto/jazz-piano-performance-modeling

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance

Paper • 2506.23869 • Published Jun 30, 2025