HT-Demucs FT β Instrumental / Other Specialist, ONNX
Melodic / instrumental specialist from HT-Demucs FT β everything that isn't vocals, drums, or bass. ONNX runtime, no PyTorch needed.
This repo packages sub-model 2 of the
htdemucs_ft 4-bag ensemble
as a single 316 MB .onnx file plus a ~150-line numpy reference inference
script. Verified to be numerically equivalent to the original PyTorch
model.
Want all 4 stems in one drop-in package? Use the full bag repo:
StemSplitio/htdemucs-ft-onnx.
TL;DR
pip install onnxruntime numpy soundfile
python infer.py your-song.mp3 ./out/
# writes ./out/other.wav at 44.1 kHz stereo
That's it. No PyTorch, no CUDA setup, no GPU server.
Quality
| Metric (MUSDB18-HQ test, 50 songs) | Value | Source |
|---|---|---|
| Median other SDR | 6.34 dB | StemSplitio/stem-separation-benchmark-2026 |
| Rank among open-source separators on other | 2nd (mdx_extra_q leads at 7.67) | same |
| ONNX vs PyTorch max abs diff | < 1e-3 | verified during export (see Day 1 spike report) |
Performance
| Runtime | Hardware | Per 7.8-s segment | Per 3-min song |
|---|---|---|---|
| onnxruntime CPU EP | Apple M4 Pro | ~1.6 s | ~22 s |
| PyTorch CPU | Apple M4 Pro | ~2.1 s | ~29 s |
| onnxruntime CUDA EP | NVIDIA L4 | ~0.4 s | ~5 s (extrapolated) |
| onnxruntime DirectML EP | RTX 4090 | ~0.2 s | ~2 s (extrapolated) |
Real-time factor on M4 Pro CPU: 0.20. Roughly 1.31Γ faster than PyTorch CPU on the same hardware.
Tooling β demucs-onnx Python package
This model can also be run (and re-exported) via the open-source
demucs-onnx Python package
on PyPI. It auto-downloads from this repo on first use.
pip install demucs-onnx
# Single specialist (this repo)
demucs-onnx separate song.mp3 stems/ --stem other
# Or via the Python API
python -c "from demucs_onnx import separate_stem; \
audio = separate_stem('song.mp3', 'other')"
The same package is also the canonical tool for exporting htdemucs
to ONNX yourself β it bundles all four blocker fixes (complex STFT,
fractions.Fraction, random.randrange,
aten::_native_multi_head_attention) so vanilla torch.onnx.export
works on your own checkpoints.
pip install "demucs-onnx[export]"
demucs-onnx export htdemucs_ft other.onnx --stem other
Common use cases
- Karaoke / instrumental tracks β extract the music-minus-vocals layer (pair with the vocals ONNX for clean round-tripping)
- Sample-flipping β isolate guitar/keys/synth lines for chopping and remixing
- Cover-song production β strip vocals and rebalance the instrumental bed
- Music-bed for video β remove vocals from licensed tracks for under-spoken-word use (check sync rights first)
Quick start
Python β minimal
import infer
other = infer.separate_other("your-song.mp3")
# other: numpy array (2, samples) at 44.1 kHz
Python β full control
import soundfile as sf
import infer
# Optional execution providers β CPU is the default and most portable.
# Swap to "coreml" on macOS, "cuda" on NVIDIA, "dml" on Windows DX12.
audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = infer.separate(audio.T, sr, providers=["CPUExecutionProvider"])
sf.write("other.wav", stems[infer.SOURCES.index("other")].T, sr)
CLI
python infer.py your-song.mp3 ./out/
python infer.py your-song.mp3 ./out/ --providers cuda # NVIDIA
python infer.py your-song.mp3 ./out/ --providers coreml # macOS
python infer.py your-song.mp3 ./out/ --providers dml # Windows
Mobile (iOS / Swift)
import onnxruntime_objc
let env = try ORTEnv(loggingLevel: .warning)
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
modelPath: Bundle.main.path(forResource: "htdemucs_ft_other", ofType: "onnx")!,
sessionOptions: opts)
// audio: 1 Γ 2 Γ 343980 Float32 buffer, then session.run(...).
Mobile (Android / Kotlin)
import ai.onnxruntime.OrtEnvironment
import ai.onnxruntime.OrtSession
val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelPath, opts)
Web (onnxruntime-web)
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("htdemucs_ft_other.onnx", {
executionProviders: ["wasm"],
graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });
// out.stems.data is a Float32Array (1, 4, 2, 343980); use row 2 for other.
Input / output spec
| Tensor | Name | Shape | Dtype | Notes |
|---|---|---|---|---|
| Input | mix |
(1, 2, 343980) |
float32 | Stereo audio, 44.1 kHz, 7.8 s segment. Values in [-1, 1]. |
| Output | stems |
(1, 4, 2, 343980) |
float32 | [drums, bass, other, vocals] order. Use only row 2 (other) β the other 3 rows are weakly-predicted by-products of the other specialist. |
For longer audio, chunk with overlap-add β see infer.py::separate for a
working ~60-line implementation.
Related repos
Sibling stem-specialist ONNX repos from the same export:
| Repo | Stem | Use when |
|---|---|---|
htdemucs-ft-drums-onnx |
drums | Drum extraction, beat transcription |
htdemucs-ft-bass-onnx |
bass | Bassline transcription, mix rebalancing |
htdemucs-ft-other-onnx |
other | Karaoke instrumentals, sample-flipping |
htdemucs-ft-vocals-onnx |
vocals | #1 open-source vocal SDR β karaoke, acapella, vocal removal |
htdemucs-ft-onnx |
all 4 | Full 4-stem separation in one repo |
PyTorch versions for HF Inference Endpoints:
htdemucs-ft-pytorch,
htdemucs-ft-other-pytorch.
Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.
Skip the infrastructure β use the StemSplit API
Don't want to ship a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead β same model under the hood, hosted for you, with credits and a dashboard.
- π stemsplit.io
- π Developer docs
- π API reference
- π Guides & recipes
Or use the no-code tools that ship the same model family:
- π§ Karaoke Maker
- π§ Vocal Remover
- π§ YouTube Stem Splitter
Files in this repo
| File | Size | Purpose |
|---|---|---|
htdemucs_ft_other.onnx |
316 MB | The exported model. Opset 17. Passes onnx.checker. |
infer.py |
~6 KB | Pure numpy + onnxruntime reference. No torch. |
requirements.txt |
<1 KB | onnxruntime, numpy, soundfile. |
README.md |
this file |
License & attribution
This repo is MIT-licensed, matching the original HT-Demucs.
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
- Original PyTorch model:
facebookresearch/demucs - ONNX export, parity verification, and packaging by StemSplit
- Search keywords: instrumental extractor onnx, karaoke maker, music minus vocals, htdemucs other onnx