Instructions to use mlx-community/MOSS-Music-8B-Thinking-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/MOSS-Music-8B-Thinking-8bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MOSS-Music-8B-Thinking-8bit mlx-community/MOSS-Music-8B-Thinking-8bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
MOSS-Music-8B-Thinking · MLX 8-bit
An 8-bit MLX quantization of OpenMOSS-Team/MOSS-Music-8B-Thinking for music understanding (captioning, key / tempo / chord, structure, lyrics ASR, long-form QA) that runs locally on Apple Silicon Macs.
Community conversion, not an official release. All model credit goes to the OpenMOSS Team.
Why this exists
On the stock PyTorch + MPS path several audio-encoder ops fall back to CPU, and local generation is effectively unusable (under 0.3 tok/s, often hanging). This MLX build runs properly on a Mac:
| PyTorch / MPS (bf16) | This model (MLX 8-bit) | |
|---|---|---|
| Size on disk | 18 GB | ~10 GB |
| Load time | ~17 s | ~1.5 s |
| One 75 s song | stalls (>13 min) | ~34 s |
| Throughput | <0.3 tok/s | ~23 tok/s |
(Indicative single-run numbers on an M4, 24 GB.)
Usage
MOSS-Music is a custom multimodal (audio + text) model, so it does not load with
mlx_lm / mlx_vlm directly. Use the moss_music_mlx backend:
- Backend code: https://github.com/dthinkr/MOSS-Music/tree/feat/mlx-backend/mlx
- Upstream PR: https://github.com/OpenMOSS/MOSS-Music/pull/3
from huggingface_hub import snapshot_download
from moss_music_mlx import load_pretrained, generate
from src.processing_moss_music import MossMusicProcessor
path = snapshot_download("mlx-community/MOSS-Music-8B-Thinking-8bit")
model = load_pretrained(path)
proc = MossMusicProcessor.from_pretrained(path, trust_remote_code=True, enable_time_marker=True)
print(generate(model, proc,
"Analyze this track: genre, key, BPM, structure.",
audio_path="song.mp3"))
Or from the command line:
python -m moss_music_mlx.generate --model <downloaded_path> --audio song.mp3 \
--prompt "Describe this music."
See the backend mlx/README.md for full setup and the parity tests.
How it was converted
- 8-bit, group size 64. The audio encoder is kept at bf16 to preserve audio
fidelity; quantization is applied to the Qwen3 layers, token embeddings and
lm_head. - Converted with
mlx==0.31.2,mlx-lm==0.29.1.
Accuracy
| Comparison | Result |
|---|---|
| 8-bit vs fp32 reference — prefill next token | argmax identical, logit cosine 0.99999 |
| 8-bit vs bf16 — prefill, 5 mixed-genre clips | argmax 5 / 5, mean cosine 0.99998 |
Greedy decoding; long sampled generations may still diverge after a near-tie token, as expected for 8-bit quantization.
License & credit
Apache-2.0, inherited from the base model. This repository provides only the MLX-quantized weights. Please cite the original authors:
@misc{mossmusic2026,
title = {MOSS-Music Technical Report},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-Music}}
}
- Downloads last month
- 10
Quantized
Model tree for mlx-community/MOSS-Music-8B-Thinking-8bit
Base model
OpenMOSS-Team/MOSS-Music-8B-Thinking