HT-Demucs (single-file 4-stem) β ONNX
The first ONNX export of the standard htdemucs (non-FT) model on
the Hugging Face Hub. Runs in onnxruntime on CPU out of the box, and
on CoreML / CUDA / DirectML with a one-line provider change.
No PyTorch required at inference.
This repo is the single-file companion to
StemSplitio/htdemucs-ft-onnx.
You get all 4 stems out of one 316 MB .onnx file (htdemucs.onnx),
or 166 MB if you grab the fp16weights variant. The FT bag is higher
quality; this single model is ~30% faster and uses 1 session instead of 4.
TL;DR
# 316 MB fp32 model:
pip install onnxruntime numpy soundfile
python infer.py your-song.mp3 ./out/ --write-all-stems
# writes ./out/{drums,bass,other,vocals}.wav at 44.1 kHz stereo
# 166 MB fp16weights variant (same runtime cost):
python infer.py your-song.mp3 ./out/ --small --write-all-stems
The repo contains:
htdemucs.onnxβ 316 MB, opset 17, parity-verified vs PyTorch fp32.htdemucs_fp16weights.onnxβ 166 MB, fp16-stored weights, same runtime memory / latency.infer.pyβ pure-numpy reference inference (~200 lines, no torch).requirements.txtβ three small packages, no PyTorch.
Quality
The official htdemucs model is the precursor to htdemucs_ft β same
architecture, single set of weights instead of 4 specialist sub-models.
On MUSDB18-HQ:
| Metric | htdemucs (this) |
htdemucs_ft (4-bag) |
|---|---|---|
| Median vocals SDR | ~8.8 dB | 9.19 dB |
| Median drums SDR | ~9.5 dB | 10.11 dB |
| Total model size | 316 MB | 1.26 GB |
| Sessions to load | 1 | 4 |
| Speed vs the bag | ~1.4Γ faster | baseline |
Parity vs PyTorch fp32 (random input, 7.8 s segment):
htdemucs.onnxmax abs diff: 6.62 Γ 10β»β΄htdemucs_fp16weights.onnxmax abs diff (vs fp32 weights): 4.6 Γ 10β»β΅
Both well within the 1e-3 publish threshold.
Performance
Single 7.8 s segment, Apple M4 Pro CPU:
| Variant | RAM | Latency | RTF |
|---|---|---|---|
htdemucs.onnx (fp32) |
~1.1 GB | ~1.6 s | 0.20 |
htdemucs_fp16weights.onnx |
~1.1 GB | ~1.6 s | 0.20 |
For comparison: htdemucs_ft (4-session bag) |
~4.0 GB | ~6.4 s | 0.49 |
CUDA / DirectML / CoreML EPs are typically β₯ 5Γ faster on real GPUs.
Quick start
Python
import soundfile as sf
import infer
audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = infer.separate(audio.T, sr,
model_path=infer.DEFAULT_MODEL,
providers=["CPUExecutionProvider"])
for stem, arr in stems.items():
sf.write(f"{stem}.wav", arr.T, sr)
CLI
python infer.py your-song.mp3 ./out/ --write-all-stems
python infer.py your-song.mp3 ./out/ --providers coreml # macOS arm64
python infer.py your-song.mp3 ./out/ --providers cuda # Linux + NVIDIA
python infer.py your-song.mp3 ./out/ --providers dml # Windows + DX12
python infer.py your-song.mp3 ./out/ --small # 166 MB variant
Mobile / Web (after pip install onnxruntime-mobile or onnxruntime-web)
// iOS / Swift
import onnxruntime_objc
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
modelPath: Bundle.main.path(forResource: "htdemucs", ofType: "onnx")!,
sessionOptions: opts)
// Browser / web
import * as ort from "onnxruntime-web";
const sess = await ort.InferenceSession.create("htdemucs_fp16weights.onnx", {
executionProviders: ["wasm"],
});
const t = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await sess.run({ mix: t }); // out.stems is (1, 4, 2, 343980)
For a turnkey browser demo with file-picker + chunked overlap-add, see
demucs-onnx browser-demo.
Input / output spec
| Tensor | Name | Shape | Dtype | Notes |
|---|---|---|---|---|
| Input | mix |
(1, 2, 343980) |
float32 | Stereo, 44.1 kHz, 7.8 s segment. Values in [-1, 1]. |
| Output | stems |
(1, 4, 2, 343980) |
float32 | Stems in order [drums, bass, other, vocals]. All 4 are real predictions (unlike the FT specialists). |
For longer audio, chunk with overlap-add β see infer.py::separate for
a working 60-line implementation.
Tooling β demucs-onnx Python package
This model can be run (and re-exported from PyTorch) via the open-source
demucs-onnx Python package
on PyPI. It auto-downloads from this repo on first use, so you don't
have to clone or wrangle file paths.
pip install demucs-onnx
# Single-file 4-stem flavor (this repo):
demucs-onnx separate song.mp3 stems/ --model htdemucs
# Python API:
python -c "from demucs_onnx import separate; \
print(separate('song.mp3', model='htdemucs').keys())"
To re-export your own fine-tune:
pip install 'demucs-onnx[export]'
demucs-onnx export htdemucs out/htdemucs.onnx
How it was built
The export pipeline lives in the open-source
demucs-onnx package at
demucs_onnx/export/.
It applies four patches to make torch.onnx.export work on htdemucs:
- Complex-typed
torch.stftoutputs βConv1dwith sin/cos kernels. model.segmentfractions.Fractionβ plainfloat.random.randrangein transformer pos-embedding β hardcodedshift=0.aten::_native_multi_head_attention(no ONNX symbolic) β drop-innn.MultiheadAttention.forwardbuilt fromLinear/bmm/softmax.
These are the four blockers every previous community attempt at "demucs onnx" stalled on. See the README of the demucs-onnx package for the full write-up with code references.
Related work
Sibling ONNX repos from the same export pipeline:
| Repo | Format | Stems | Use when |
|---|---|---|---|
htdemucs-onnx (this) |
Single file | 4 | Faster startup, fewer sessions, ~30% lower latency than the FT bag. |
htdemucs-ft-onnx |
Bag of 4 files | 4 | Best SDR, especially on vocals. The default in StemSplit production. |
htdemucs-6s-onnx |
Single file | 6 | Need guitar + piano stems on top of the standard 4. |
htdemucs-ft-{drums,bass,other,vocals}-onnx |
Single specialist | 1 | Fastest single-stem inference; 4Γ faster than the bag. |
Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.
Skip the infrastructure β use the StemSplit API
Don't want to bundle a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead β same model under the hood, hosted for you, with credits and a dashboard.
- π stemsplit.io
- π Developer docs
- π API reference
Or use the no-code tools that ship the same model family:
- π€ Vocal Remover
- πΆ Karaoke Maker
- ποΈ Acapella Maker
- πΊ YouTube Stem Splitter
License & attribution
This repo is MIT-licensed, matching the original HT-Demucs.
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
- Original PyTorch model:
facebookresearch/demucs - ONNX export, parity verification, and packaging by StemSplit
- Search keywords: htdemucs onnx, demucs onnx single file, demucs ios, demucs android, music source separation onnx, stem separation mobile.