MOSS-Audio-8B-Thinking-MLX-hybrid

Release Status

Current community release: Yes
Variant: MLX-hybrid (INT4 LLM + BF16 audio path), no PyTorch at inference
Canonical pair: Released alongside MOSS-Audio-4B-Thinking-MLX-4bit

This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-8B-Thinking and released as the MLX-hybrid variant.

LLM: Qwen3-8B INT4 (group size 64)
Audio path: encoder + adapter + DeepStack in BF16 on MLX
Runtime: no PyTorch at inference time
Target: desktop/workstation (not mobile-viable)

Layout

mlx_llm/         Qwen3-8B INT4 weights + tokenizer      (4.3 GB)
mlx_audio/       Audio encoder + adapter + DeepStack    (1.6 GB)
                 mergers, all BF16
scripts/         Pure-MLX bridge source (same as 4B)
inference.py     Standalone example

Users who want a smaller bundle can INT4-quantize the audio path themselves. We didn't ship INT4 8B audio by default because the 8B's value prop is quality, not size — and at 1.6 GB, BF16 audio isn't the binding constraint.

Measured on Apple M3 Ultra

	This bundle	Full BF16
Decode speed	110 t/s	25 t/s
Decode steady-state peak	5.0 GB	33 GB
Transient prefill peak	8.1 GB	33 GB
Disk footprint	5.9 GB	18.1 GB

Audio quality varies by domain (speech vs non-speech); validate on your target data.

Usage

Same as the 4B bundle — the inference.py script auto-detects size:

pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav

No PyTorch required at runtime (mel path ported to MLX).

Evaluation (BFCL v3, community-run)

Text-only tool-calling evaluation (not an audio-caption quality metric), run on a 600-sample subset (simple/multiple/parallel = 200 each), greedy decoding (repetition_penalty=0), measured 2026-05.

Category	This bundle	8B BF16	Qwen3-8B base
simple (200)	91.0%	93.5%	94.5%
multiple (200)	92.5%	93.0%	94.5%
parallel (200)	54.5%	78.5%	90.5%
3-cat avg	79.3%	88.3%	93.2%

Single-call categories (simple/multiple) within 1–2 pp of BF16. Parallel degrades ~24 pp under INT4 — genuine decode regression. For speech-dominant use cases (dialog, interviews, transcription), tool-calling quality is strong. Parser note: this model can emit bare JSON tool calls; evaluation accepted schema-valid {name, arguments} JSON as tool-call output.

Limitations

Non-speech instability (upstream-intrinsic). 8B is unstable on ambient / music / non-speech clips: 2/9 hard truncations plus 2 further degenerate outputs (digit-loops, hallucinated JSON "metadata") on our 7-clip × 3-rep EN-scope benchmark. Raising --repetition-penalty to 1.05 makes this worse, not better (10/21 truncations vs 6/21 at RP=1.02).

This failure mode matches the PyTorch reference's behavior (10/21 truncations reported in the v2 benchmark), so it is not an MLX-port regression. Recommend this bundle for speech-dominant use cases (dialog transcription, speaker attribution, spoken-event analysis) and avoid it for music genre tagging or ambient-scene captioning — use the 4B bundle for those.

Contributors

Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, hybrid runtime design, benchmarking, and release packaging.

License

Apache-2.0 (inherited from base model).

Citation

Base model: MOSS-Audio.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid

Base model

OpenMOSS-Team/MOSS-Audio-8B-Thinking

Finetuned

(1)

this model