Magenta RealTime 2 β€” ONNX

ONNX export of google/magenta-realtime-2 (MRT2), an open-weights real-time music-generation model from Google DeepMind. This repo re-packages the MRT2 components as ONNX graphs that run with onnxruntime on CPU, CUDA GPU, and the web (onnxruntime-web or jax-js) β€” no JAX, TensorFlow, or Apple-MLX runtime required.

Unofficial community export by @blanchon. All model weights Β© Google LLC, redistributed under CC-BY-4.0 (see License & terms below). Converted from the original google/magenta-realtime-2 artifacts.

Components

MRT2 is three models chained together:

Component Role This repo
MusicCoCa text / audio β†’ 768-d style embedding β†’ 12 RVQ style tokens musiccoca/*.onnx
SpectroStream 48 kHz stereo audio codec (encode ↔ 12-RVQ tokens ↔ decode) spectrostream/*.onnx
Depthformer LLM autoregressive frame-wise generator of audio tokens (style + MIDI + context β†’ tokens) mrt2_small/onnx/ (230M, fp32)

The full text-to-music pipeline is: MusicCoCa(prompt) β†’ style tokens β†’ Depthformer generates SpectroStream tokens frame-by-frame (25 Hz) β†’ SpectroStream decoder β†’ 48 kHz stereo audio.

Repository layout

musiccoca/                 # style model (5 ONNX graphs + SentencePiece)
  text_encoder.onnx  audio_preprocessor.onnx  music_encoder.onnx
  pretrained_vector_quantizer.onnx  mapper.onnx  spm.model
spectrostream/             # audio codec
  encoder.onnx  decoder.onnx
mrt2_small/onnx/           # 230M Depthformer LLM, self-contained fp32 graphs
  encoder.onnx  temporal_step.onnx  depth_step.onnx  embed.onnx

Every .onnx here is a single self-contained file (no external .onnx.data).

How the LLM is structured

The Depthformer is exported as four self-contained ONNX graphs β€” encoder.onnx, temporal_step.onnx, depth_step.onnx, embed.onnx β€” driven by a thin host-side runtime loop that carries the fixed-size windowed KV-cache between frames and does sampling host-side. For each 25 Hz frame the temporal step runs once, then the depth step + embed run once per RVQ level (12) to emit the 12 codes for that frame. A complete, readable reference implementation of this loop is src/lib/mrt2.ts in the demo Space (jax-js).

Quick start (Python, onnxruntime)

import onnxruntime as ort, numpy as np
# Each graph is provider-agnostic β€” CPU here, or CUDA / web elsewhere.
sess = ort.InferenceSession("musiccoca/pretrained_vector_quantizer.onnx",
                            providers=["CPUExecutionProvider"])  # ["CUDAExecutionProvider", ...] for GPU
emb = np.zeros((1, 768), np.float32)
tokens = sess.run(None, {sess.get_inputs()[0].name: emb})[0]    # 12 RVQ style tokens
print(tokens)

The four LLM graphs and the SpectroStream decoder are driven exactly as in the runtime loop described above (see the demo Space for the canonical sequence of feeds).

Quick start (Web)

A live in-browser demo (with full source) is the 🎡 demo Space: the Depthformer LLM runs natively in jax-js (WebGPU), and the SpectroStream decoder runs in onnxruntime-web (WASM) because it uses ops jax-js doesn't implement. Pick a prompt, generate, and play the result.

To load a single graph directly with onnxruntime-web:

<script type="module">
import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.all.min.mjs";
const sess = await ort.InferenceSession.create("musiccoca/pretrained_vector_quantizer.onnx",
                                               { executionProviders: ["webgpu", "wasm"] });
const emb = new ort.Tensor("float32", new Float32Array(768), [1, 768]);
const out = await sess.run({ [sess.inputNames[0]]: emb });
console.log(out[sess.outputNames[0]].data);   // 12 RVQ style tokens
</script>

Validation

All graphs were validated against the original models. Discrete outputs (RVQ tokens) match exactly; continuous outputs match within fp32 tolerance:

Component vs original
MusicCoCa (text β†’ style tokens) token-exact
SpectroStream codec codes exact, decode ≀ 9e-5
Depthformer LLM (small, fp32) codes exact (PyTorch & ONNX, in-browser)
Full pipeline (prompt β†’ audio) codes exact vs JAX fp32

How this was exported

  • MusicCoCa: TFLite β†’ ONNX via a patched tf2onnx (added FULLY_CONNECTED keep_num_dims, GELU, EMBEDDING_LOOKUP handlers), and the log-mel STFT RFFT/ComplexAbs island was replaced with an equivalent DFT cos/sin matmul so it uses only ONNX-standard, web-compatible ops.
  • SpectroStream & Depthformer: the original sequence-layers/JAX graphs do not lower cleanly through jax2tf (opaque XlaCallModule) or jax2onnx (shape-tracing limits), so these were reimplemented in PyTorch from the architecture and checkpoint weights, validated numerically against the JAX reference, and exported with torch.onnx.

License & terms

Magenta RealTime 2 is released by Google under a combination of licenses: the codebase under Apache 2.0 and the model weights under CC-BY-4.0. The ONNX artifacts here are derived from those weights and are redistributed under CC-BY-4.0, with attribution to Google LLC / Google DeepMind. Google's MRT2 usage terms apply (Copyright 2026 Google LLC; use responsibly, do not generate infringing content; Google claims no rights in your outputs; "AS IS", no warranty). See the original model card.

Citation

@inproceedings{gdmlyria2025live,
  title={Live Music Models},
  author={Caillon, Antoine and others},
  booktitle={NeurIPS Creative AI}, year={2025}
}

Original: https://huggingface.co/google/magenta-realtime-2 Β· https://github.com/magenta/magenta-realtime

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for blanchon/magenta-realtime-2-onnx

Quantized
(3)
this model

Spaces using blanchon/magenta-realtime-2-onnx 3