HT-Demucs FT β Instrumental / Other Specialist (PyTorch)
Melodic / instrumental specialist from HT-Demucs FT β everything that isn't vocals, drums, or bass.
This is sub-model 2 of the 4-bag htdemucs_ft ensemble by
DΓ©fossez et al. (Meta AI), extracted as a standalone
~160 MB model. It produces the other stem with the same quality as
the full ensemble (median SDR 6.34 dB on MUSDB18-HQ β 2nd (close behind mdx_extra_q at 7.67) of all
models in our 2026 benchmark) at roughly 1/4 the compute cost.
Want all 4 stems in one request? Use the full ensemble:
StemSplitio/htdemucs-ft-pytorchWant a hosted REST API with credits and a dashboard? Use the StemSplit API.
Why this model
| Property | This model | Full htdemucs_ft bag |
|---|---|---|
| Disk size | ~160 MB | ~640 MB |
| Per-3-min-song latency (M4 Pro MPS) | ~22 s (RTF 0.12) | ~47 s (RTF 0.26) |
| Instrumental / Other SDR on MUSDB18-HQ | 6.34 dB | 6.34 dB (identical β the bag's other output IS this sub-model's output) |
| Other stems returned | None (focused) | All 4 |
If you only need the other stem in production, this is strictly faster and smaller than the full ensemble with identical other quality β ~2.6Γ faster wall time in our smoke tests on M4 Pro MPS.
Common use cases
- Karaoke / instrumental tracks β extract the music-minus-vocals layer for karaoke mixes (use it with our
htdemucs-ft-pytorchvocals model to round-trip) - Sample-flipping β isolate guitar/keys/synth lines for chopping and remixing
- Cover-song production β remove vocals and rebalance the instrumental bed
- Music-bed for video β strip vocals from licensed tracks for under-spoken-word use (check your sync rights first)
Quick start (Python)
import base64, io, soundfile as sf
from huggingface_hub import InferenceClient
with open("your-song.mp3", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
client = InferenceClient(model="StemSplitio/htdemucs-ft-other-pytorch")
result = client.post(json={"inputs": audio_b64})
wav, sr = sf.read(io.BytesIO(base64.b64decode(result["other"])))
sf.write("out_other.wav", wav, sr)
Or run locally without Hugging Face at all:
import torch, soundfile as sf
from demucs.apply import apply_model
from demucs.audio import convert_audio
from demucs.pretrained import get_model
bag = get_model("htdemucs_ft")
model = bag.models[2].eval() # the other specialist
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
wav = torch.from_numpy(wav.T).contiguous()
wav = convert_audio(wav, sr, bag.samplerate, bag.audio_channels).unsqueeze(0)
with torch.no_grad():
stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
# bag.sources == ["drums", "bass", "other", "vocals"]; pick the other row
sf.write("out_other.wav", stems[bag.sources.index("other")].T.numpy(), bag.samplerate)
Deploy on Hugging Face Inference Endpoints
Click Deploy β Inference Endpoints above, pick a GPU instance, and HF
will spin up a container running handler.py.
| Hardware | Latency for 3-min song |
|---|---|
| NVIDIA L4 | ~3 s |
| NVIDIA T4 small | ~7 s |
| CPU x4 (basic) | ~48 s |
(Roughly 2.6Γ faster than the full-bag latency, since we run only this specialist sub-model. Cloud GPU numbers extrapolated from M4 Pro measurements.)
curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
-H "Authorization: Bearer $HF_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
Try it in your browser, no code
Related models from StemSplit
| Repo | Stem | When to use |
|---|---|---|
htdemucs-ft-pytorch |
all 4 | When you need vocals + drums + bass + other in one request |
htdemucs-ft-vocals-pytorch |
vocals | Best vocal SDR in our benchmark (9.19 dB) β karaoke, acapella |
htdemucs-ft-drums-pytorch |
drums | Drum extraction, beat transcription, sample-pack creation |
htdemucs-ft-bass-pytorch |
bass | Bassline transcription, mix rebalancing |
htdemucs-ft-other-pytorch |
other / instrumental | Karaoke instrumentals, sample-flipping, music-bed extraction |
Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.
License & attribution
This repo is MIT-licensed, matching the original HT-Demucs.
Original authors (please cite if you use this model in research):
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
- Original model:
facebookresearch/demucs - Packaging by StemSplit
- Search keywords: instrumental extractor, karaoke maker, music minus vocals, AI instrumental separator