Mega-ASR β€” CoreML LUT-4 (end-to-end ASR)

CoreML LUT-4 (4-bit lookup-table palettized) deployment of zhifeixie/Mega-ASR, with an input_embeds-aware decoder so audio embeddings can be scattered at <|audio_pad|> positions to do real ASR β€” not just text generation.

Converted via ANEMLL with a custom coreml_convert_embeds.py that monkey-patches QwenModel.forward + QwenForCausalLM.forward to accept pre-embedded hidden_states (skipping the internal embed_tokens lookup). The model is single-token-step, stateful KV cache (28 layers Γ— 2 Γ— 8 KV heads Γ— 512 ctx Γ— 128 head_dim, fp16), LUT-4 weights at --per_channel 8, and fp32 compute precision β€” compute_precision=FLOAT16 overflows in Qwen3-ASR's RMSNorm/attention layers and produces NaN logits.

What's in this repo

File Size Role
coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/ 826 MB Recommended. Qwen3 1.7B LLM, inputs_embeds input, fp32 compute, LUT-4 weights. Pair with the ONNX audio encoder for end-to-end ASR.
coreml/mega-asr-llm_lut4.mlpackage/ 974 MB Original input_ids variant β€” standalone Qwen3 1.7B text LLM (no audio scatter).
onnx/audio_encoder_fp32.onnx 1.27 GB 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending)
tokenizer/* β€” Original Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, etc.)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench
inference_asr.py β€” End-to-end ASR pipeline: ONNX encoder + CoreML LLM
convert_embeds.py β€” The custom converter (use to reproduce / re-quantize)

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 βˆ’ WER), prompt forced to language English, on M-series Mac CPU (CPU_AND_NE failed to compile for ANE due to model size + state):

Per-sample Hyp β‰ˆ Ref? Agreement
distortion exact match 100%
dropout exact match 100%
far_field exact match 100%
mixed exact match 100%
noise exact match 100%
obstructed "i have forgotten" vs "i forgot" 88.2%
echo (hard, heavy reverb) "size 25 stand not and the 125 walk" 47.1%
recording (hard, truncated audio) "train stopped at the station" 60.0%
AVERAGE 86.9%

For reference (same 8 samples, same audio encoder, same prompt):

Backend Agreement
ONNX recommended (GPTQ) 92.7%
MLX recommended (mixed 8/4) 92.2%
CoreML LUT-4 (this repo) 86.9%
ONNX RTN INT4 baseline 87.8%

LUT-4 k-means is a more aggressive quantization than ONNX GPTQ (which uses activation-aware error redistribution) or MLX mixed 8/4 (which keeps the 4 attention projections at 8-bit). The roughly 6% gap vs the leaders is concentrated on the 2 hard samples (echo, recording) and one near-miss on obstructed. Six of eight samples produce exact-match transcriptions.

Inference

pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
    --mlpackage coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage \
    --encoder-path onnx/audio_encoder_fp32.onnx \
    --examples-dir examples \
    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir>

The pipeline runs:

  1. Mel features via Qwen3-ASR's WhisperFeatureExtractor.
  2. Audio encoder (ONNX fp32) β†’ audio embeddings (F, 2048).
  3. Prompt + scatter: build the Qwen3-ASR chat template, expand the single <|audio_pad|> placeholder to F slots, lookup text embeds via the original HF model's embed_tokens weight, scatter audio embeds in.
  4. CoreML prefill: feed each token's embedding one-at-a-time to populate the KV cache state.
  5. CoreML decode: greedy step-by-step until <|im_end|>.

The KV cache lives inside the CoreML model as state. Call model.make_state() once per request, then pass the same state object to every predict() call.

Conversion details

Two-step monkey-patch in convert_embeds.py lets ANEMLL's Qwen3 conversion accept pre-embedded inputs:

# 1. QwenModel.forward β€” detect float input_ids and skip embed_tokens
qm.QwenModel.forward = model_forward_or_embeds

# 2. QwenForCausalLM.forward β€” relax the 2D assert; replicate lm_head logic
qm.QwenForCausalLM.forward = causal_forward_or_embeds

ANEMLL's CoreML conversion then traces with a WrapperEmbeds module whose inputs are (inputs_embeds, position_ids, causal_mask, current_pos, update_mask). coremltools.optimize.coreml.palettize_weights applies LUT-4 with per_grouped_channel / group_size=8.

Key compute-precision tweak: compute_precision=ct.precision.FLOAT32 in ct.convert. fp16 compute produces all-NaN logits on Qwen3-ASR's RMSNorm + attention layers β€” same finding as the aoiandroid community CoreML port. Weights stay LUT-4 (4-bit storage); only activations run fp32.

Also patched: coremltools/converters/mil/frontend/torch/ops.py _cast op handler (numpy array of size 1 β†’ extract scalar via .flatten()[0].item()). Diff lives in convert_embeds.py setup notes.

Known limitations

  1. CPU compute only in practice. CoreML's ANE compiler rejects this model (MILCompilerForANE error: failed to compile ANE model using ANEF) β€” likely due to model size + stateful KV cache. CPU_AND_NE / ALL fail to load; CPU_ONLY works and is correct. Per-token latency is ~1.5 s on CPU.
  2. Audio encoder is ONNX. The 24-layer Whisper-style encoder hasn't been ported to CoreML (ANEMLL is LLM-only). End-to-end inference runs the encoder via onnxruntime and the LLM via coremltools.
  3. Quality below ONNX/MLX at 4-bit due to LUT-4 k-means being weaker than GPTQ on this architecture. Mitigations: use LUT-6 (--lut 6 in the converter) to recover ~3% at +50% size, or use the fp16 variant (mega-asr-llm-embeds_fp16.mlpackage, ~3.2 GB) for full quality.

Companion repos

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/mega-asr-coreml

Quantized
(4)
this model