MOSS-Audio-8B-Instruct-MLX (hybrid: INT4 LLM + BF16 audio)

An Apple MLX conversion of MOSS-Audio-8B-Instruct — the ASR-strongest MOSS-Audio checkpoint — for fast, low-memory inference on Apple Silicon. LLM quantized to uniform INT4 (group_size 64); audio encoder + adapter + DeepStack kept in BF16.

Why this exists. The community had MLX builds only of the Thinking variant. But Thinking is not ASR-optimized: under identical INT4 quantization it mis-spells letter-spoken tickers (e.g. "CRWD" → "CWD") and is unstable. Instruct transcribes them correctly. This build brings Instruct's transcription quality to MLX speed/memory.

中文:這是 MOSS-Audio-8B-Instruct 的 Apple MLX 轉換版(LLM uniform INT4 + audio 路徑 BF16)。 社群原本只有 Thinking 變體的 MLX 版,但 Thinking 非 ASR 優化——相同 INT4 量化下會把唸出字母的 ticker(如 "CRWD")辨識成 "CWD" 且不穩定。Instruct 辨識正確。本版把 Instruct 的轉錄品質 帶到 MLX 的速度與記憶體。

Measured (Apple M1 Max 32GB, 28s zh+en clip)

Metric PyTorch Instruct This (Instruct-MLX) Thinking-MLX
Ticker "CRWD" C R W D ✅ C R W D ✅ CWD ❌
English term (TradingView) ✅(loops)
Numerals Chinese chars Arabic 47% Arabic
Speed 1.8x realtime 6–9x 5–8x
Peak memory ~17 GB 7.85 GB 7.85 GB
Disk 18 GB 5.9 GB 5.9 GB

Key finding. Ticker-ASR degradation in the Thinking-MLX builds comes from the Thinking/Instruct training difference, not from INT4 quantization — under the same uniform INT4, Instruct keeps the ticker. So uniform 4-bit suffices; no mixed-precision needed.

Usage

pip install mlx mlx-lm soundfile numpy
python inference.py --audio your_clip_16k_mono.wav

Transcription with per-segment timestamps (a Traditional-Chinese prompt triggers zh-Hant output):

python run_moss.py --model . --audio clip.wav \
  --prompt "請逐句轉錄這段音訊,每句標註開始時間。" --temp 0 --repetition-penalty 1.02
  • Audio: 16 kHz mono. Encoder window is Whisper-style 30 s max — chunk longer audio.
  • Decoding: use greedy (temp=0) for ASR fidelity. temp>0 removes the rare tail digit-loop but degrades content (wrong numerals, out-of-order timestamps).
  • digit-loop: occasionally the model fails to emit EOS and repeats a digit token at the very tail; post-truncate repeated trailing digits. Quantization weakens EOS; it is a known, harmless tail artifact for transcription use.

How it was converted

Pure metadata-mapped weight conversion (no retraining):

  1. stage1_mapping.py — verify every MLX target key is sourceable from the PyTorch checkpoint; discover the conv layout transform transpose(0,2,3,1) (PyTorch [out,in,h,w] → MLX [out,h,w,in]).
  2. stage2_convert.py — extract language_model.* + lm_head, quantize to INT4 (group_size 64) via mlx; extract audio encoder/adapter/DeepStack, apply the conv transpose, save BF16. Output mirrors the RumiLabs bridge layout exactly.

Limitations

  • 30-second audio window (chunk + offset timestamps for longer input).
  • Tail digit-loop under greedy (post-truncate).
  • Homophone errors on domain terms (e.g. 300均 → 三百軍) — fix with a glossary/post-pass.

Credits

License

Apache-2.0 (inherited from base model).

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fredchu/MOSS-Audio-8B-Instruct-MLX

Finetuned
(1)
this model