Mega-ASR β CoreML LUT-4 (end-to-end ASR)
CoreML LUT-4 (4-bit lookup-table palettized) deployment of
zhifeixie/Mega-ASR, with an
input_embeds-aware decoder so audio embeddings can be scattered at
<|audio_pad|> positions to do real ASR β not just text generation.
Converted via ANEMLL with a custom
coreml_convert_embeds.py that monkey-patches QwenModel.forward +
QwenForCausalLM.forward to accept pre-embedded hidden_states (skipping the
internal embed_tokens lookup). The model is single-token-step, stateful KV
cache (28 layers Γ 2 Γ 8 KV heads Γ 512 ctx Γ 128 head_dim, fp16), LUT-4
weights at --per_channel 8, and fp32 compute precision β compute_precision=FLOAT16
overflows in Qwen3-ASR's RMSNorm/attention layers and produces NaN logits.
What's in this repo
| File | Size | Role |
|---|---|---|
coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/ |
826 MB | Recommended. Qwen3 1.7B LLM, inputs_embeds input, fp32 compute, LUT-4 weights. Pair with the ONNX audio encoder for end-to-end ASR. |
coreml/mega-asr-llm_lut4.mlpackage/ |
974 MB | Original input_ids variant β standalone Qwen3 1.7B text LLM (no audio scatter). |
onnx/audio_encoder_fp32.onnx |
1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
tokenizer/* |
β | Original Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, etc.) |
examples/*.wav |
~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
inference_asr.py |
β | End-to-end ASR pipeline: ONNX encoder + CoreML LLM |
convert_embeds.py |
β | The custom converter (use to reproduce / re-quantize) |
Quality (bench)
8-clip Voices-in-the-Wild-Bench
agreement (1 β WER), prompt forced to language English, on M-series Mac
CPU (CPU_AND_NE failed to compile for ANE due to model size + state):
| Per-sample | Hyp β Ref? | Agreement |
|---|---|---|
| distortion | exact match | 100% |
| dropout | exact match | 100% |
| far_field | exact match | 100% |
| mixed | exact match | 100% |
| noise | exact match | 100% |
| obstructed | "i have forgotten" vs "i forgot" | 88.2% |
| echo (hard, heavy reverb) | "size 25 stand not and the 125 walk" | 47.1% |
| recording (hard, truncated audio) | "train stopped at the station" | 60.0% |
| AVERAGE | 86.9% |
For reference (same 8 samples, same audio encoder, same prompt):
| Backend | Agreement |
|---|---|
| ONNX recommended (GPTQ) | 92.7% |
| MLX recommended (mixed 8/4) | 92.2% |
| CoreML LUT-4 (this repo) | 86.9% |
| ONNX RTN INT4 baseline | 87.8% |
LUT-4 k-means is a more aggressive quantization than ONNX GPTQ (which uses
activation-aware error redistribution) or MLX mixed 8/4 (which keeps the
4 attention projections at 8-bit). The roughly 6% gap vs the leaders is
concentrated on the 2 hard samples (echo, recording) and one near-miss
on obstructed. Six of eight samples produce exact-match transcriptions.
Inference
pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
--mlpackage coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage \
--encoder-path onnx/audio_encoder_fp32.onnx \
--examples-dir examples \
--qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir>
The pipeline runs:
- Mel features via Qwen3-ASR's
WhisperFeatureExtractor. - Audio encoder (ONNX fp32) β audio embeddings
(F, 2048). - Prompt + scatter: build the Qwen3-ASR chat template, expand the single
<|audio_pad|>placeholder toFslots, lookup text embeds via the original HF model'sembed_tokensweight, scatter audio embeds in. - CoreML prefill: feed each token's embedding one-at-a-time to populate the KV cache state.
- CoreML decode: greedy step-by-step until
<|im_end|>.
The KV cache lives inside the CoreML model as state. Call model.make_state()
once per request, then pass the same state object to every predict() call.
Conversion details
Two-step monkey-patch in convert_embeds.py lets ANEMLL's Qwen3 conversion
accept pre-embedded inputs:
# 1. QwenModel.forward β detect float input_ids and skip embed_tokens
qm.QwenModel.forward = model_forward_or_embeds
# 2. QwenForCausalLM.forward β relax the 2D assert; replicate lm_head logic
qm.QwenForCausalLM.forward = causal_forward_or_embeds
ANEMLL's CoreML conversion then traces with a WrapperEmbeds module whose
inputs are (inputs_embeds, position_ids, causal_mask, current_pos, update_mask).
coremltools.optimize.coreml.palettize_weights applies LUT-4 with
per_grouped_channel / group_size=8.
Key compute-precision tweak: compute_precision=ct.precision.FLOAT32
in ct.convert. fp16 compute produces all-NaN logits on Qwen3-ASR's
RMSNorm + attention layers β same finding as the aoiandroid community
CoreML port. Weights stay LUT-4 (4-bit storage); only activations run fp32.
Also patched: coremltools/converters/mil/frontend/torch/ops.py _cast op
handler (numpy array of size 1 β extract scalar via .flatten()[0].item()).
Diff lives in convert_embeds.py setup notes.
Known limitations
- CPU compute only in practice. CoreML's ANE compiler rejects this model
(
MILCompilerForANE error: failed to compile ANE model using ANEF) β likely due to model size + stateful KV cache. CPU_AND_NE / ALL fail to load; CPU_ONLY works and is correct. Per-token latency is ~1.5 s on CPU. - Audio encoder is ONNX. The 24-layer Whisper-style encoder hasn't been
ported to CoreML (ANEMLL is LLM-only). End-to-end inference runs the
encoder via
onnxruntimeand the LLM viacoremltools. - Quality below ONNX/MLX at 4-bit due to LUT-4 k-means being weaker than
GPTQ on this architecture. Mitigations: use LUT-6 (
--lut 6in the converter) to recover ~3% at +50% size, or use the fp16 variant (mega-asr-llm-embeds_fp16.mlpackage, ~3.2 GB) for full quality.
Companion repos
- Reza2kn/mega-asr-onnx β full ONNX pipeline (GPTQ-INT4, 92.7%)
- Reza2kn/mega-asr-mlx β MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
- Reza2kn/mega-asr-bench β browser demo (WebGPU)
Credits
- Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
- CoreML conversion via ANEMLL with a custom input_embeds patch
- Benchmark: Voices-in-the-Wild-Bench
- Downloads last month
- -
Model tree for Reza2kn/mega-asr-coreml
Base model
zhifeixie/Mega-ASR