HT-Demucs FT β€” Full 4-Stem Bag, ONNX

The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub. Four parity-verified ONNX models (drums, bass, other, vocals) plus a ~250-line numpy aggregator that runs the full 4-stem separation in pure onnxruntime. No PyTorch required at inference. Runs on CPU / CoreML / CUDA / DirectML.

This repo is the convenience drop β€” all 4 specialist sub-models of htdemucs_ft in one place, with a working bag-inference script. If you only need one stem in production, the individual stem-specialist repos below are ~75% smaller and ~4Γ— faster per song.


TL;DR

pip install onnxruntime numpy soundfile
python bag_infer.py your-song.mp3 ./out/
# writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav

That's it. The 4 .onnx files (316 MB each, ~1.26 GB total) live alongside the script.


Quality

Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4 via museval. Identical to the official PyTorch htdemucs_ft β€” the bag's per-stem output IS the corresponding specialist's output (the weight matrix is one-hot per stem).

Stem SDR (dB) Rank in our 2026 benchmark
vocals 9.19 #1 (highest open-source vocal SDR)
drums 10.11 #2 (mdx_extra_q leads at 11.49)
bass 10.38 #2 (mdx_extra_q leads at 11.42)
other 6.34 #2 (mdx_extra_q leads at 7.67)

Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.

ONNX vs PyTorch parity: verified to < 1e-3 max abs diff on every stem during export. See the Day 1 spike report for the full engineering writeup.


Performance

Real measurements on an Apple M4 Pro:

Mode Hardware Per 3-min song Notes
One specialist (htdemucs-ft-drums-onnx) M4 Pro CPU ~22 s 4Γ— faster, 75% smaller β€” use this if you only need one stem
Full bag (this repo) M4 Pro CPU ~88 s RTF ~0.5. 4 sub-models Γ— N chunks.
Full bag M4 Pro CPU (8 threads) ~60 s With OMP_NUM_THREADS=8 and SessionOptions tuned
Full bag NVIDIA L4 CUDA ~6 s Extrapolated from per-specialist CUDA numbers
Full bag NVIDIA T4 ~16 s Extrapolated
PyTorch full bag M4 Pro MPS ~47 s Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly.

Tooling β€” demucs-onnx Python package

This bag is also packaged in the open-source demucs-onnx Python package on PyPI. It auto-downloads each specialist from the matching HF repo on first use, so you don't even need to manually fetch the four .onnx files.

pip install demucs-onnx

# Full 4-stem separation (auto-downloads ~1.26 GB on first run)
demucs-onnx separate song.mp3 stems/

# From Python
python -c "from demucs_onnx import separate; stems = separate('song.mp3')"

The same package is also the canonical tool for exporting htdemucs to ONNX yourself β€” it bundles all four blocker fixes (complex STFT, fractions.Fraction, random.randrange, aten::_native_multi_head_attention) so vanilla torch.onnx.export works on your own demucs checkpoints.

pip install "demucs-onnx[export]"
demucs-onnx export htdemucs_ft out/   # writes 4 .onnx files

Common use cases

  • Karaoke makers β€” out/other.wav minus out/vocals.wav gives a clean karaoke track plus an acapella in one pass.
  • DAW stem export β€” drop the 4 .wav files into Ableton / Logic / Reaper as separate channels for remixing.
  • DJ stems software β€” load all 4 stems as live-mixable tracks.
  • AI music apps β€” feed each stem into downstream models (drum transcription, bassline-to-MIDI, vocal pitch correction).
  • Acapella sampling β€” clean isolated vocals at the highest SDR available in open source.
  • Mobile / on-device separation β€” replaces a 1+ GB PyTorch install with onnxruntime's 50 MB binary on iOS / Android.

Quick start

Python β€” as a library

import bag_infer

stems = bag_infer.separate_all("your-song.mp3")
# stems: dict[str, numpy.ndarray (2, samples)]
#   stems["drums"], stems["bass"], stems["other"], stems["vocals"]

Python β€” with execution provider control

import soundfile as sf
import bag_infer

audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = bag_infer.separate(
    audio.T, sr,
    providers=["CPUExecutionProvider"],  # or "CoreMLExecutionProvider", etc.
)
for name, audio in stems.items():
    sf.write(f"{name}.wav", audio.T, sr)

CLI

python bag_infer.py your-song.mp3 ./out/
python bag_infer.py your-song.mp3 ./out/ --providers cuda
python bag_infer.py your-song.mp3 ./out/ --providers coreml
python bag_infer.py your-song.mp3 ./out/ --providers dml

Web / mobile

Each specialist is a vanilla onnxruntime model; just load all 4 sessions and reuse the aggregation logic in bag_infer.py::separate. See the individual stem repos for platform-specific snippets: drums Β· bass Β· other Β· vocals.


How aggregation works

The htdemucs_ft bag uses a one-hot weight matrix for combining the 4 sub-models β€” model 0's drums output is used directly as the bag's drums stem, model 1's bass output is the bag's bass stem, and so on. No weighted-sum aggregation needed.

That means:

  • The bag's drums stem == the drums specialist's drums output (bit-exact in fp32)
  • Same for bass, other, vocals
  • So you can ship only the specialists you need and get identical per-stem quality to the full bag at 1/4 the size

bag_infer.py simply runs all 4 specialists and picks the relevant row from each. ~30 lines of numpy.


Input / output spec per sub-model

Tensor Name Shape Dtype Notes
Input mix (1, 2, 343980) float32 Stereo audio, 44.1 kHz, 7.8 s segment.
Output stems (1, 4, 2, 343980) float32 [drums, bass, other, vocals]. Use only the specialist's target row.

For longer audio, the bag script handles overlap-add chunking.


Files in this repo

File Size Purpose
htdemucs_ft_drums.onnx 316 MB Drums specialist (bag index 0)
htdemucs_ft_bass.onnx 316 MB Bass specialist (bag index 1)
htdemucs_ft_other.onnx 316 MB Other specialist (bag index 2)
htdemucs_ft_vocals.onnx 316 MB Vocals specialist (bag index 3)
bag_infer.py 7 KB Pure numpy aggregator. No torch.
requirements.txt <1 KB onnxruntime, numpy, soundfile.
README.md this file

Total: ~1.26 GB. If that's too big, use individual stem repos.


Related work

Repo Stem Use when
htdemucs-ft-drums-onnx drums Only need drums (1/4 size, 1/4 latency)
htdemucs-ft-bass-onnx bass Only need bass
htdemucs-ft-other-onnx other Only need "other" / instrumental
htdemucs-ft-vocals-onnx vocals #1 open-source vocal SDR

PyTorch versions for HF Inference Endpoints: htdemucs-ft-pytorch and its 4 sibling specialist repos.


Skip the infrastructure β€” use the StemSplit API

Don't want to ship 1.26 GB of .onnx files in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead β€” same models under the hood, hosted for you, with credits and a dashboard.

Or use the no-code tools that ship this same model family:


License & attribution

MIT-licensed, matching the original HT-Demucs.

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
  • Original PyTorch model: facebookresearch/demucs
  • ONNX export, parity verification, and packaging by StemSplit
  • Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train StemSplitio/htdemucs-ft-onnx

Collection including StemSplitio/htdemucs-ft-onnx