HT-Demucs FT β€” Instrumental / Other Specialist (PyTorch)

Melodic / instrumental specialist from HT-Demucs FT β€” everything that isn't vocals, drums, or bass.

This is sub-model 2 of the 4-bag htdemucs_ft ensemble by DΓ©fossez et al. (Meta AI), extracted as a standalone ~160 MB model. It produces the other stem with the same quality as the full ensemble (median SDR 6.34 dB on MUSDB18-HQ β€” 2nd (close behind mdx_extra_q at 7.67) of all models in our 2026 benchmark) at roughly 1/4 the compute cost.

Want all 4 stems in one request? Use the full ensemble: StemSplitio/htdemucs-ft-pytorch

Want a hosted REST API with credits and a dashboard? Use the StemSplit API.


Why this model

Property This model Full htdemucs_ft bag
Disk size ~160 MB ~640 MB
Per-3-min-song latency (M4 Pro MPS) ~22 s (RTF 0.12) ~47 s (RTF 0.26)
Instrumental / Other SDR on MUSDB18-HQ 6.34 dB 6.34 dB (identical β€” the bag's other output IS this sub-model's output)
Other stems returned None (focused) All 4

If you only need the other stem in production, this is strictly faster and smaller than the full ensemble with identical other quality β€” ~2.6Γ— faster wall time in our smoke tests on M4 Pro MPS.


Common use cases

  • Karaoke / instrumental tracks β€” extract the music-minus-vocals layer for karaoke mixes (use it with our htdemucs-ft-pytorch vocals model to round-trip)
  • Sample-flipping β€” isolate guitar/keys/synth lines for chopping and remixing
  • Cover-song production β€” remove vocals and rebalance the instrumental bed
  • Music-bed for video β€” strip vocals from licensed tracks for under-spoken-word use (check your sync rights first)

Quick start (Python)

import base64, io, soundfile as sf
from huggingface_hub import InferenceClient

with open("your-song.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

client = InferenceClient(model="StemSplitio/htdemucs-ft-other-pytorch")
result = client.post(json={"inputs": audio_b64})

wav, sr = sf.read(io.BytesIO(base64.b64decode(result["other"])))
sf.write("out_other.wav", wav, sr)

Or run locally without Hugging Face at all:

import torch, soundfile as sf
from demucs.apply import apply_model
from demucs.audio import convert_audio
from demucs.pretrained import get_model

bag = get_model("htdemucs_ft")
model = bag.models[2].eval()  # the other specialist
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
wav = torch.from_numpy(wav.T).contiguous()
wav = convert_audio(wav, sr, bag.samplerate, bag.audio_channels).unsqueeze(0)

with torch.no_grad():
    stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]

# bag.sources == ["drums", "bass", "other", "vocals"]; pick the other row
sf.write("out_other.wav", stems[bag.sources.index("other")].T.numpy(), bag.samplerate)

Deploy on Hugging Face Inference Endpoints

Click Deploy β†’ Inference Endpoints above, pick a GPU instance, and HF will spin up a container running handler.py.

Hardware Latency for 3-min song
NVIDIA L4 ~3 s
NVIDIA T4 small ~7 s
CPU x4 (basic) ~48 s

(Roughly 2.6Γ— faster than the full-bag latency, since we run only this specialist sub-model. Cloud GPU numbers extrapolated from M4 Pro measurements.)

curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"

Try it in your browser, no code


Related models from StemSplit

Repo Stem When to use
htdemucs-ft-pytorch all 4 When you need vocals + drums + bass + other in one request
htdemucs-ft-vocals-pytorch vocals Best vocal SDR in our benchmark (9.19 dB) β€” karaoke, acapella
htdemucs-ft-drums-pytorch drums Drum extraction, beat transcription, sample-pack creation
htdemucs-ft-bass-pytorch bass Bassline transcription, mix rebalancing
htdemucs-ft-other-pytorch other / instrumental Karaoke instrumentals, sample-flipping, music-bed extraction

Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.


License & attribution

This repo is MIT-licensed, matching the original HT-Demucs.

Original authors (please cite if you use this model in research):

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
  • Original model: facebookresearch/demucs
  • Packaging by StemSplit
  • Search keywords: instrumental extractor, karaoke maker, music minus vocals, AI instrumental separator
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train StemSplitio/htdemucs-ft-other-pytorch

Collection including StemSplitio/htdemucs-ft-other-pytorch