Instructions to use RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MOSS-Audio-8B-Thinking-MLX-hybrid RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
MOSS-Audio-8B-Thinking-MLX-hybrid
Release Status
- Current community release: Yes
- Variant: MLX-hybrid (INT4 LLM + BF16 audio path), no PyTorch at inference
- Canonical pair: Released alongside
MOSS-Audio-4B-Thinking-MLX-4bit
This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-8B-Thinking and released as the MLX-hybrid variant.
- LLM: Qwen3-8B INT4 (group size 64)
- Audio path: encoder + adapter + DeepStack in BF16 on MLX
- Runtime: no PyTorch at inference time
- Target: desktop/workstation (not mobile-viable)
Layout
mlx_llm/ Qwen3-8B INT4 weights + tokenizer (4.3 GB)
mlx_audio/ Audio encoder + adapter + DeepStack (1.6 GB)
mergers, all BF16
scripts/ Pure-MLX bridge source (same as 4B)
inference.py Standalone example
Users who want a smaller bundle can INT4-quantize the audio path themselves. We didn't ship INT4 8B audio by default because the 8B's value prop is quality, not size โ and at 1.6 GB, BF16 audio isn't the binding constraint.
Measured on Apple M3 Ultra
| This bundle | Full BF16 | |
|---|---|---|
| Decode speed | 110 t/s | 25 t/s |
| Decode steady-state peak | 5.0 GB | 33 GB |
| Transient prefill peak | 8.1 GB | 33 GB |
| Disk footprint | 5.9 GB | 18.1 GB |
Audio quality varies by domain (speech vs non-speech); validate on your target data.
Usage
Same as the 4B bundle โ the inference.py script auto-detects size:
pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav
No PyTorch required at runtime (mel path ported to MLX).
Evaluation (BFCL v3, community-run)
Text-only tool-calling evaluation (not an audio-caption quality metric), run on a
600-sample subset (simple/multiple/parallel = 200 each), greedy decoding
(repetition_penalty=0), measured 2026-05.
| Category | This bundle | 8B BF16 | Qwen3-8B base |
|---|---|---|---|
| simple (200) | 91.0% | 93.5% | 94.5% |
| multiple (200) | 92.5% | 93.0% | 94.5% |
| parallel (200) | 54.5% | 78.5% | 90.5% |
| 3-cat avg | 79.3% | 88.3% | 93.2% |
Single-call categories (simple/multiple) within 1โ2 pp of BF16. Parallel
degrades ~24 pp under INT4 โ genuine decode regression. For speech-dominant
use cases (dialog, interviews, transcription), tool-calling quality is strong.
Parser note: this model can emit bare JSON tool calls; evaluation accepted
schema-valid {name, arguments} JSON as tool-call output.
Limitations
Non-speech instability (upstream-intrinsic). 8B is unstable on
ambient / music / non-speech clips: 2/9 hard truncations plus 2 further
degenerate outputs (digit-loops, hallucinated JSON "metadata") on our
7-clip ร 3-rep EN-scope benchmark. Raising --repetition-penalty to
1.05 makes this worse, not better (10/21 truncations vs 6/21 at RP=1.02).
This failure mode matches the PyTorch reference's behavior (10/21 truncations reported in the v2 benchmark), so it is not an MLX-port regression. Recommend this bundle for speech-dominant use cases (dialog transcription, speaker attribution, spoken-event analysis) and avoid it for music genre tagging or ambient-scene captioning โ use the 4B bundle for those.
Contributors
- Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, hybrid runtime design, benchmarking, and release packaging.
License
Apache-2.0 (inherited from base model).
Citation
Base model: MOSS-Audio.
Quantized
Model tree for RumiLabs/MOSS-Audio-8B-Thinking-MLX-hybrid
Base model
OpenMOSS-Team/MOSS-Audio-8B-Thinking