MOSS-Audio-8B-Thinking-MLX-hybrid

Release Status

  • Current community release: Yes
  • Variant: MLX-hybrid (INT4 LLM + BF16 audio path), no PyTorch at inference
  • Canonical pair: Released alongside MOSS-Audio-4B-Thinking-MLX-4bit

This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-8B-Thinking and released as the MLX-hybrid variant.

  • LLM: Qwen3-8B INT4 (group size 64)
  • Audio path: encoder + adapter + DeepStack in BF16 on MLX
  • Runtime: no PyTorch at inference time
  • Target: desktop/workstation (not mobile-viable)

Layout

mlx_llm/         Qwen3-8B INT4 weights + tokenizer      (4.3 GB)
mlx_audio/       Audio encoder + adapter + DeepStack    (1.6 GB)
                 mergers, all BF16
scripts/         Pure-MLX bridge source (same as 4B)
inference.py     Standalone example

Users who want a smaller bundle can INT4-quantize the audio path themselves. We didn't ship INT4 8B audio by default because the 8B's value prop is quality, not size โ€” and at 1.6 GB, BF16 audio isn't the binding constraint.

Measured on Apple M3 Ultra

This bundle Full BF16
Decode speed 110 t/s 25 t/s
Decode steady-state peak 5.0 GB 33 GB
Transient prefill peak 8.1 GB 33 GB
Disk footprint 5.9 GB 18.1 GB

Audio quality varies by domain (speech vs non-speech); validate on your target data.

Usage

Same as the 4B bundle โ€” the inference.py script auto-detects size:

pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav

No PyTorch required at runtime (mel path ported to MLX).

Evaluation (BFCL v3, community-run)

Text-only tool-calling evaluation (not an audio-caption quality metric), run on a 600-sample subset (simple/multiple/parallel = 200 each), greedy decoding (repetition_penalty=0), measured 2026-05.

Category This bundle 8B BF16 Qwen3-8B base
simple (200) 91.0% 93.5% 94.5%
multiple (200) 92.5% 93.0% 94.5%
parallel (200) 54.5% 78.5% 90.5%
3-cat avg 79.3% 88.3% 93.2%

Single-call categories (simple/multiple) within 1โ€“2 pp of BF16. Parallel degrades ~24 pp under INT4 โ€” genuine decode regression. For speech-dominant use cases (dialog, interviews, transcription), tool-calling quality is strong. Parser note: this model can emit bare JSON tool calls; evaluation accepted schema-valid {name, arguments} JSON as tool-call output.

Limitations

Non-speech instability (upstream-intrinsic). 8B is unstable on ambient / music / non-speech clips: 2/9 hard truncations plus 2 further degenerate outputs (digit-loops, hallucinated JSON "metadata") on our 7-clip ร— 3-rep EN-scope benchmark. Raising --repetition-penalty to 1.05 makes this worse, not better (10/21 truncations vs 6/21 at RP=1.02).

This failure mode matches the PyTorch reference's behavior (10/21 truncations reported in the v2 benchmark), so it is not an MLX-port regression. Recommend this bundle for speech-dominant use cases (dialog transcription, speaker attribution, spoken-event analysis) and avoid it for music genre tagging or ambient-scene captioning โ€” use the 4B bundle for those.

Contributors

  • Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, hybrid runtime design, benchmarking, and release packaging.

License

Apache-2.0 (inherited from base model).

Citation

Base model: MOSS-Audio.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid

Finetuned
(1)
this model