Qwen2.5-Omni-3B Audio Understanding β Core AI
Qwen2.5-Omni-3B's Thinker converted to Apple
Core AI (.aimodel / .aimodelc, iOS 27 / macOS 27) for on-device audio understanding β
the model describes the sounds it hears (events, texture, emotion, music), it is not a
transcriber. "I hear a loud hissing sound." Β· "β¦a continuous sine wave sound." Β· "β¦a series of
beeps."
Part of the CoreAI-Model-Zoo. Device-verified on iPhone 17 Pro (A19 Pro) and M4 Max.
What's here
Two models, run as a pair on the coreai-pipelined GPU engine:
| path | what | size |
|---|---|---|
gpu-pipelined/qwen2_5_omni_3b_thinker_int8lin_n750_s1/ |
text decoder (Qwen2.5-3B int8lin, S=1) β macOS | 3.9 GB |
gpu-pipelined/qwen2_5_omni_3b_audio_encoder_fp16_k15.aimodel/ |
Whisper-style audio encoder (fp16, K=15 β 30 s) β both platforms | 1.2 GB |
ios/qwen2_5_omni_3b_thinker_n750_ios/ |
text decoder AOT (.aimodelc, iPhone 17 Pro / h18p) |
4.5 GB |
The decoder's audio embeds ride one static-input buffer (audio_embeds [750,2048]); the prompt's
<|AUDIO|> placeholders carry extension ids vocab + slot the graph gathers. TMRoPE collapses to
1-D for audio+text, so positions are engine-native (no rope-shift inputs). iPhone uses the AOT
decoder so the 3.9 GB graph dodges the on-device JIT jetsam; the AOT weights mmap as clean pages, so
it loads comfortably (β5.9 GB headroom after load on a 12 GB device, with the
increased-memory-limit entitlement).
Use it
The coreai-audio app
(record from the mic / choose a file / demo clip β "what do you hear?"), or
CoreAIKit:
let model = try await KitAudioModel(model: .qwen2_5Omni3B) // downloads decoder + encoder
try await model.attach(samples: pcm16kMono) // mel β encoder β static buffer
let answer = try await LanguageModelSession(model: model).respond(to: "What do you hear?")
The 16 kHz log-mel front end is Whisper-large-v3 (Accelerate/vDSP), bit-exact with the HF feature extractor (gated cos 1.0). Any clip is decoded to 16 kHz mono, β€ ~30 s.
Conversion / numerics
Conversion code + gates:
conversion (export_qwen2_5_omni_thinker.py /
export_qwen2_5_omni_audio.py). Decoder int8lin gates top-1-exact vs the fp32 HF oracle; the
encoder static rework is cos 1.0 vs eager (GPU 0.99999); the Swift vDSP mel is cos 1.0 vs the HF
extractor. iPhone greedy matches the Mac content (white-noise β "I hear a loud hissing sound.").
License
Apache-2.0 (inherits Qwen2.5-Omni-3B). A community conversion β not affiliated with Alibaba or Apple.
Model tree for mlboydaisuke/Qwen2.5-Omni-3B-Audio-CoreAI
Base model
Qwen/Qwen2.5-Omni-3B