MULE (PyTorch) — matteospanio/mule
Pretrained weights for an unofficial PyTorch port of MULE (Musicset Unsupervised Large Embedding), the SF-NFNet-F0 music-audio representation model from SiriusXM/Pandora:
Supervised and Unsupervised Learning of Audio Representations for Music Understanding, M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, A. F. Ehmann. ISMIR 2022. https://arxiv.org/abs/2210.03799
These weights were converted, not re-trained — transferred from the original
TensorFlow/Keras model.keras into the PyTorch implementation and verified to be
numerically equivalent (end-to-end clip-embedding cosine 0.9999999 vs the
original pipeline; ONNX backbone max-abs < 1e-6; 62.35 M params).
Library / code: https://github.com/matteospanio/mule-torch
⚠️ Unofficial. This is an independent community port from TensorFlow to PyTorch. It is not affiliated with, endorsed by, or maintained by SiriusXM, Pandora, or the original authors. All credit for the model goes to them.
Files
| File | What |
|---|---|
model.safetensors |
Full model state dict (SF-NFNet-F0 backbone + mel filterbank buffer), ~267 MB. |
config.json |
Architecture + frontend + slicing constants (rebuilds MuleConfig). |
backbone.onnx |
Self-contained ONNX export of the backbone ((N,1,96,300) log-mel slice → (N,1728)), opset 17, dynamic batch. ~252 MB. |
Usage
pip install mule-torch # or: pip install git+https://github.com/matteospanio/mule-torch
import torch
from mule_torch import MuleModel
# Downloads these weights from the Hub by default.
model = MuleModel.from_pretrained() # == from_pretrained(hf_repo="matteospanio/mule")
waveform = torch.randn(1, 16000 * 10) # (B, T) mono @ 16 kHz, in [-1, 1]
emb = model(waveform) # (B, 1728)
ONNX (backbone only)
The full waveform→embedding path includes a data-dependent number of 2-second
slices, so the ONNX export covers the backbone (one standardized 96×300
log-mel slice → 1728-d). Do the mel front-end + slicing in torch/host, then run
slices through backbone.onnx:
import onnxruntime as ort, numpy as np
sess = ort.InferenceSession("backbone.onnx", providers=["CPUExecutionProvider"])
emb = sess.run(None, {"mel_slice": slices.astype(np.float32)})[0] # (N, 1728)
Input convention
16 kHz mono waveform in [-1, 1]. The model computes a 96-band log-mel
spectrogram, slices it into 96×300 windows every ~2 s, runs the backbone, and
mean-pools the per-slice 1728-d embeddings into one vector per clip.
The original
AudioFilereader scales PCM16 by1/2^16; conventional[-1,1]audio tracks the original closely but isn't bit-identical (thelog10(10000·x+1)mel compression is non-linear).
License
These weights are a derivative of the original MULE weights, released by
Pandora/SiriusXM under CC BY-NC 4.0, and inherit that non-commercial
license. The mule-torch source code is GPL-3.0-only. Please cite McCallum et
al. (2022).
@inproceedings{mccallum2022mule,
title = {Supervised and Unsupervised Learning of Audio Representations for Music Understanding},
author = {McCallum, Matthew C. and Korzeniowski, Filip and Oramas, Sergio and Gouyon, Fabien and Ehmann, Andreas F.},
booktitle = {Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR)},
year = {2022},
url = {https://arxiv.org/abs/2210.03799}
}
- Downloads last month
- 25