HT-Demucs FT β Full 4-Stem Bag, ONNX
The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub.
Four parity-verified ONNX models (drums, bass, other, vocals) plus a
~250-line numpy aggregator that runs the full 4-stem separation in pure
onnxruntime. No PyTorch required at inference. Runs on CPU /
CoreML / CUDA / DirectML.
This repo is the convenience drop β all 4 specialist sub-models of
htdemucs_ft in one place, with a working bag-inference script. If you
only need one stem in production, the individual stem-specialist repos
below are ~75% smaller and ~4Γ faster per song.
TL;DR
pip install onnxruntime numpy soundfile
python bag_infer.py your-song.mp3 ./out/
# writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav
That's it. The 4 .onnx files (316 MB each, ~1.26 GB total) live
alongside the script.
Quality
Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4
via museval. Identical to the official PyTorch htdemucs_ft β the
bag's per-stem output IS the corresponding specialist's output (the weight
matrix is one-hot per stem).
| Stem | SDR (dB) | Rank in our 2026 benchmark |
|---|---|---|
| vocals | 9.19 | #1 (highest open-source vocal SDR) |
| drums | 10.11 | #2 (mdx_extra_q leads at 11.49) |
| bass | 10.38 | #2 (mdx_extra_q leads at 11.42) |
| other | 6.34 | #2 (mdx_extra_q leads at 7.67) |
Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.
ONNX vs PyTorch parity: verified to < 1e-3 max abs diff on every stem during export. See the Day 1 spike report for the full engineering writeup.
Performance
Real measurements on an Apple M4 Pro:
| Mode | Hardware | Per 3-min song | Notes |
|---|---|---|---|
One specialist (htdemucs-ft-drums-onnx) |
M4 Pro CPU | ~22 s | 4Γ faster, 75% smaller β use this if you only need one stem |
| Full bag (this repo) | M4 Pro CPU | ~88 s | RTF ~0.5. 4 sub-models Γ N chunks. |
| Full bag | M4 Pro CPU (8 threads) | ~60 s | With OMP_NUM_THREADS=8 and SessionOptions tuned |
| Full bag | NVIDIA L4 CUDA | ~6 s | Extrapolated from per-specialist CUDA numbers |
| Full bag | NVIDIA T4 | ~16 s | Extrapolated |
| PyTorch full bag | M4 Pro MPS | ~47 s | Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly. |
Tooling β demucs-onnx Python package
This bag is also packaged in the open-source
demucs-onnx Python package
on PyPI. It auto-downloads each specialist from the matching HF repo on
first use, so you don't even need to manually fetch the four .onnx
files.
pip install demucs-onnx
# Full 4-stem separation (auto-downloads ~1.26 GB on first run)
demucs-onnx separate song.mp3 stems/
# From Python
python -c "from demucs_onnx import separate; stems = separate('song.mp3')"
The same package is also the canonical tool for exporting htdemucs
to ONNX yourself β it bundles all four blocker fixes (complex STFT,
fractions.Fraction, random.randrange,
aten::_native_multi_head_attention) so vanilla torch.onnx.export
works on your own demucs checkpoints.
pip install "demucs-onnx[export]"
demucs-onnx export htdemucs_ft out/ # writes 4 .onnx files
Common use cases
- Karaoke makers β
out/other.wavminusout/vocals.wavgives a clean karaoke track plus an acapella in one pass. - DAW stem export β drop the 4
.wavfiles into Ableton / Logic / Reaper as separate channels for remixing. - DJ stems software β load all 4 stems as live-mixable tracks.
- AI music apps β feed each stem into downstream models (drum transcription, bassline-to-MIDI, vocal pitch correction).
- Acapella sampling β clean isolated vocals at the highest SDR available in open source.
- Mobile / on-device separation β replaces a 1+ GB PyTorch install
with
onnxruntime's 50 MB binary on iOS / Android.
Quick start
Python β as a library
import bag_infer
stems = bag_infer.separate_all("your-song.mp3")
# stems: dict[str, numpy.ndarray (2, samples)]
# stems["drums"], stems["bass"], stems["other"], stems["vocals"]
Python β with execution provider control
import soundfile as sf
import bag_infer
audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = bag_infer.separate(
audio.T, sr,
providers=["CPUExecutionProvider"], # or "CoreMLExecutionProvider", etc.
)
for name, audio in stems.items():
sf.write(f"{name}.wav", audio.T, sr)
CLI
python bag_infer.py your-song.mp3 ./out/
python bag_infer.py your-song.mp3 ./out/ --providers cuda
python bag_infer.py your-song.mp3 ./out/ --providers coreml
python bag_infer.py your-song.mp3 ./out/ --providers dml
Web / mobile
Each specialist is a vanilla onnxruntime model; just load all 4 sessions
and reuse the aggregation logic in bag_infer.py::separate. See the
individual stem repos for platform-specific snippets:
drums Β·
bass Β·
other Β·
vocals.
How aggregation works
The htdemucs_ft bag uses a one-hot weight matrix for combining the 4
sub-models β model 0's drums output is used directly as the bag's drums
stem, model 1's bass output is the bag's bass stem, and so on. No
weighted-sum aggregation needed.
That means:
- The bag's drums stem == the drums specialist's drums output (bit-exact in fp32)
- Same for bass, other, vocals
- So you can ship only the specialists you need and get identical per-stem quality to the full bag at 1/4 the size
bag_infer.py simply runs all 4 specialists and picks the relevant row
from each. ~30 lines of numpy.
Input / output spec per sub-model
| Tensor | Name | Shape | Dtype | Notes |
|---|---|---|---|---|
| Input | mix |
(1, 2, 343980) |
float32 | Stereo audio, 44.1 kHz, 7.8 s segment. |
| Output | stems |
(1, 4, 2, 343980) |
float32 | [drums, bass, other, vocals]. Use only the specialist's target row. |
For longer audio, the bag script handles overlap-add chunking.
Files in this repo
| File | Size | Purpose |
|---|---|---|
htdemucs_ft_drums.onnx |
316 MB | Drums specialist (bag index 0) |
htdemucs_ft_bass.onnx |
316 MB | Bass specialist (bag index 1) |
htdemucs_ft_other.onnx |
316 MB | Other specialist (bag index 2) |
htdemucs_ft_vocals.onnx |
316 MB | Vocals specialist (bag index 3) |
bag_infer.py |
7 KB | Pure numpy aggregator. No torch. |
requirements.txt |
<1 KB | onnxruntime, numpy, soundfile. |
README.md |
this file |
Total: ~1.26 GB. If that's too big, use individual stem repos.
Related work
| Repo | Stem | Use when |
|---|---|---|
htdemucs-ft-drums-onnx |
drums | Only need drums (1/4 size, 1/4 latency) |
htdemucs-ft-bass-onnx |
bass | Only need bass |
htdemucs-ft-other-onnx |
other | Only need "other" / instrumental |
htdemucs-ft-vocals-onnx |
vocals | #1 open-source vocal SDR |
PyTorch versions for HF Inference Endpoints:
htdemucs-ft-pytorch
and its 4 sibling specialist repos.
Skip the infrastructure β use the StemSplit API
Don't want to ship 1.26 GB of .onnx files in your app, manage a GPU
pool, or write overlap-add chunking? Use the
StemSplit API instead β same models
under the hood, hosted for you, with credits and a dashboard.
- π stemsplit.io
- π Developer docs
- π API reference
Or use the no-code tools that ship this same model family:
- π€ Vocal Remover
- πΆ Karaoke Maker
- ποΈ Acapella Maker
- πΊ YouTube Stem Splitter
License & attribution
MIT-licensed, matching the original HT-Demucs.
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
- Original PyTorch model:
facebookresearch/demucs - ONNX export, parity verification, and packaging by StemSplit
- Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation