MOSS-Music-8B-Thinking · MLX 8-bit

An 8-bit MLX quantization of OpenMOSS-Team/MOSS-Music-8B-Thinking for music understanding (captioning, key / tempo / chord, structure, lyrics ASR, long-form QA) that runs locally on Apple Silicon Macs.

Community conversion, not an official release. All model credit goes to the OpenMOSS Team.

Other sizes: 6-bit · 4-bit

Why this exists

On the stock PyTorch + MPS path several audio-encoder ops fall back to CPU, and local generation is effectively unusable (under 0.3 tok/s, often hanging). This MLX build runs properly on a Mac:

	PyTorch / MPS (bf16)	This model (MLX 8-bit)
Size on disk	18 GB	~10 GB
Load time	~17 s	~1.5 s
One 75 s song	stalls (>13 min)	~34 s
Throughput	<0.3 tok/s	~23 tok/s

(Indicative single-run numbers on an M4, 24 GB.)

Usage

MOSS-Music is a custom multimodal (audio + text) model, so it does not load with mlx_lm / mlx_vlm directly. Use the moss_music_mlx backend:

Backend code: https://github.com/dthinkr/MOSS-Music/tree/feat/mlx-backend/mlx
Upstream PR: https://github.com/OpenMOSS/MOSS-Music/pull/3

from huggingface_hub import snapshot_download
from moss_music_mlx import load_pretrained, generate
from src.processing_moss_music import MossMusicProcessor

path = snapshot_download("mlx-community/MOSS-Music-8B-Thinking-8bit")
model = load_pretrained(path)
proc = MossMusicProcessor.from_pretrained(path, trust_remote_code=True, enable_time_marker=True)

print(generate(model, proc,
               "Analyze this track: genre, key, BPM, structure.",
               audio_path="song.mp3"))

Or from the command line:

python -m moss_music_mlx.generate --model <downloaded_path> --audio song.mp3 \
  --prompt "Describe this music."

See the backend mlx/README.md for full setup and the parity tests.

How it was converted

8-bit, group size 64. The audio encoder is kept at bf16 to preserve audio fidelity; quantization is applied to the Qwen3 layers, token embeddings and lm_head.
Converted with mlx==0.31.2, mlx-lm==0.29.1.

Accuracy

Comparison	Result
8-bit vs fp32 reference — prefill next token	argmax identical, logit cosine 0.99999
8-bit vs bf16 — prefill, 5 mixed-genre clips	argmax 5 / 5, mean cosine 0.99998

Greedy decoding; long sampled generations may still diverge after a near-tie token, as expected for 8-bit quantization.

License & credit

Apache-2.0, inherited from the base model. This repository provides only the MLX-quantized weights. Please cite the original authors:

@misc{mossmusic2026,
  title  = {MOSS-Music Technical Report},
  author = {OpenMOSS Team},
  year   = {2026},
  howpublished = {\url{https://github.com/OpenMOSS/MOSS-Music}}
}

Downloads last month: 10

Safetensors

Model size

3B params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/MOSS-Music-8B-Thinking-8bit

Base model

OpenMOSS-Team/MOSS-Music-8B-Thinking

Finetuned

(3)

this model