MOSS-Music-8B-Thinking · MLX 8-bit

Base model MLX License Backend

An 8-bit MLX quantization of OpenMOSS-Team/MOSS-Music-8B-Thinking for music understanding (captioning, key / tempo / chord, structure, lyrics ASR, long-form QA) that runs locally on Apple Silicon Macs.

Community conversion, not an official release. All model credit goes to the OpenMOSS Team.

Other sizes: 6-bit · 4-bit

Why this exists

On the stock PyTorch + MPS path several audio-encoder ops fall back to CPU, and local generation is effectively unusable (under 0.3 tok/s, often hanging). This MLX build runs properly on a Mac:

PyTorch / MPS (bf16) This model (MLX 8-bit)
Size on disk 18 GB ~10 GB
Load time ~17 s ~1.5 s
One 75 s song stalls (>13 min) ~34 s
Throughput <0.3 tok/s ~23 tok/s

(Indicative single-run numbers on an M4, 24 GB.)

Usage

MOSS-Music is a custom multimodal (audio + text) model, so it does not load with mlx_lm / mlx_vlm directly. Use the moss_music_mlx backend:

from huggingface_hub import snapshot_download
from moss_music_mlx import load_pretrained, generate
from src.processing_moss_music import MossMusicProcessor

path = snapshot_download("mlx-community/MOSS-Music-8B-Thinking-8bit")
model = load_pretrained(path)
proc = MossMusicProcessor.from_pretrained(path, trust_remote_code=True, enable_time_marker=True)

print(generate(model, proc,
               "Analyze this track: genre, key, BPM, structure.",
               audio_path="song.mp3"))

Or from the command line:

python -m moss_music_mlx.generate --model <downloaded_path> --audio song.mp3 \
  --prompt "Describe this music."

See the backend mlx/README.md for full setup and the parity tests.

How it was converted

  • 8-bit, group size 64. The audio encoder is kept at bf16 to preserve audio fidelity; quantization is applied to the Qwen3 layers, token embeddings and lm_head.
  • Converted with mlx==0.31.2, mlx-lm==0.29.1.

Accuracy

Comparison Result
8-bit vs fp32 reference — prefill next token argmax identical, logit cosine 0.99999
8-bit vs bf16 — prefill, 5 mixed-genre clips argmax 5 / 5, mean cosine 0.99998

Greedy decoding; long sampled generations may still diverge after a near-tie token, as expected for 8-bit quantization.

License & credit

Apache-2.0, inherited from the base model. This repository provides only the MLX-quantized weights. Please cite the original authors:

@misc{mossmusic2026,
  title  = {MOSS-Music Technical Report},
  author = {OpenMOSS Team},
  year   = {2026},
  howpublished = {\url{https://github.com/OpenMOSS/MOSS-Music}}
}
Downloads last month
10
Safetensors
Model size
3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/MOSS-Music-8B-Thinking-8bit

Finetuned
(3)
this model