Mega-ASR — CoreML LUT-4 (end-to-end ASR)

CoreML LUT-4 (4-bit lookup-table palettized) deployment of zhifeixie/Mega-ASR, with an input_embeds-aware decoder so audio embeddings can be scattered at <|audio_pad|> positions to do real ASR — not just text generation.

Converted via ANEMLL with a custom coreml_convert_embeds.py that monkey-patches QwenModel.forward + QwenForCausalLM.forward to accept pre-embedded hidden_states (skipping the internal embed_tokens lookup). The model is single-token-step, stateful KV cache (28 layers × 2 × 8 KV heads × 512 ctx × 128 head_dim, fp16), LUT-4 weights at --per_channel 8, and fp32 compute precision — compute_precision=FLOAT16 overflows in Qwen3-ASR's RMSNorm/attention layers and produces NaN logits.

What's in this repo

File	Size	Role
`coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/`	826 MB	Recommended. Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, LUT-4 weights. Pair with the ONNX audio encoder for end-to-end ASR.
`coreml/mega-asr-llm_lut4.mlpackage/`	974 MB	Original `input_ids` variant — standalone Qwen3 1.7B text LLM (no audio scatter).
`onnx/audio_encoder_fp32.onnx`	1.27 GB	24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending)
`tokenizer/*`	—	Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench
`inference_asr.py`	—	End-to-end ASR pipeline: ONNX encoder + CoreML LLM
`convert_embeds.py`	—	The custom converter (use to reproduce / re-quantize)

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 − WER), prompt forced to language English, on M-series Mac CPU (CPU_AND_NE failed to compile for ANE due to model size + state):

Per-sample	Hyp ≈ Ref?	Agreement
distortion	exact match	100%
dropout	exact match	100%
far_field	exact match	100%
mixed	exact match	100%
noise	exact match	100%
obstructed	"i have forgotten" vs "i forgot"	88.2%
echo (hard, heavy reverb)	"size 25 stand not and the 125 walk"	47.1%
recording (hard, truncated audio)	"train stopped at the station"	60.0%
AVERAGE		86.9%

For reference (same 8 samples, same audio encoder, same prompt):

Backend	Agreement
ONNX recommended (GPTQ)	92.7%
MLX recommended (mixed 8/4)	92.2%
CoreML LUT-4 (this repo)	86.9%
ONNX RTN INT4 baseline	87.8%

LUT-4 k-means is a more aggressive quantization than ONNX GPTQ (which uses activation-aware error redistribution) or MLX mixed 8/4 (which keeps the 4 attention projections at 8-bit). The roughly 6% gap vs the leaders is concentrated on the 2 hard samples (echo, recording) and one near-miss on obstructed. Six of eight samples produce exact-match transcriptions.

Inference

pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
    --mlpackage coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage \
    --encoder-path onnx/audio_encoder_fp32.onnx \
    --examples-dir examples \
    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir>

The pipeline runs:

Mel features via Qwen3-ASR's WhisperFeatureExtractor.
Audio encoder (ONNX fp32) → audio embeddings (F, 2048).
Prompt + scatter: build the Qwen3-ASR chat template, expand the single <|audio_pad|> placeholder to F slots, lookup text embeds via the original HF model's embed_tokens weight, scatter audio embeds in.
CoreML prefill: feed each token's embedding one-at-a-time to populate the KV cache state.
CoreML decode: greedy step-by-step until <|im_end|>.

The KV cache lives inside the CoreML model as state. Call model.make_state() once per request, then pass the same state object to every predict() call.

Conversion details

Two-step monkey-patch in convert_embeds.py lets ANEMLL's Qwen3 conversion accept pre-embedded inputs:

# 1. QwenModel.forward — detect float input_ids and skip embed_tokens
qm.QwenModel.forward = model_forward_or_embeds

# 2. QwenForCausalLM.forward — relax the 2D assert; replicate lm_head logic
qm.QwenForCausalLM.forward = causal_forward_or_embeds

ANEMLL's CoreML conversion then traces with a WrapperEmbeds module whose inputs are (inputs_embeds, position_ids, causal_mask, current_pos, update_mask). coremltools.optimize.coreml.palettize_weights applies LUT-4 with per_grouped_channel / group_size=8.

Key compute-precision tweak: compute_precision=ct.precision.FLOAT32 in ct.convert. fp16 compute produces all-NaN logits on Qwen3-ASR's RMSNorm + attention layers — same finding as the aoiandroid community CoreML port. Weights stay LUT-4 (4-bit storage); only activations run fp32.

Also patched: coremltools/converters/mil/frontend/torch/ops.py _cast op handler (numpy array of size 1 → extract scalar via .flatten()[0].item()). Diff lives in convert_embeds.py setup notes.

Known limitations

CPU compute only in practice. CoreML's ANE compiler rejects this model (MILCompilerForANE error: failed to compile ANE model using ANEF) — likely due to model size + stateful KV cache. CPU_AND_NE / ALL fail to load; CPU_ONLY works and is correct. Per-token latency is ~1.5 s on CPU.
Audio encoder is ONNX. The 24-layer Whisper-style encoder hasn't been ported to CoreML (ANEMLL is LLM-only). End-to-end inference runs the encoder via onnxruntime and the LLM via coremltools.
Quality below ONNX/MLX at 4-bit due to LUT-4 k-means being weaker than GPTQ on this architecture. Mitigations: use LUT-6 (--lut 6 in the converter) to recover ~3% at +50% size, or use the fp16 variant (mega-asr-llm-embeds_fp16.mlpackage, ~3.2 GB) for full quality.

Companion repos

Reza2kn/mega-asr-onnx — full ONNX pipeline (GPTQ-INT4, 92.7%)
Reza2kn/mega-asr-mlx — MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
Reza2kn/mega-asr-bench — browser demo (WebGPU)

Credits

Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
CoreML conversion via ANEMLL with a custom input_embeds patch
Benchmark: Voices-in-the-Wild-Bench

Downloads last month: -

Model tree for Reza2kn/mega-asr-coreml

Base model

zhifeixie/Mega-ASR

Quantized

(4)

this model