Qwen2.5-Omni-3B Audio Understanding β€” Core AI

Qwen2.5-Omni-3B's Thinker converted to Apple Core AI (.aimodel / .aimodelc, iOS 27 / macOS 27) for on-device audio understanding β€” the model describes the sounds it hears (events, texture, emotion, music), it is not a transcriber. "I hear a loud hissing sound." Β· "…a continuous sine wave sound." Β· "…a series of beeps."

Part of the CoreAI-Model-Zoo. Device-verified on iPhone 17 Pro (A19 Pro) and M4 Max.

What's here

Two models, run as a pair on the coreai-pipelined GPU engine:

path what size
gpu-pipelined/qwen2_5_omni_3b_thinker_int8lin_n750_s1/ text decoder (Qwen2.5-3B int8lin, S=1) β€” macOS 3.9 GB
gpu-pipelined/qwen2_5_omni_3b_audio_encoder_fp16_k15.aimodel/ Whisper-style audio encoder (fp16, K=15 β‰ˆ 30 s) β€” both platforms 1.2 GB
ios/qwen2_5_omni_3b_thinker_n750_ios/ text decoder AOT (.aimodelc, iPhone 17 Pro / h18p) 4.5 GB

The decoder's audio embeds ride one static-input buffer (audio_embeds [750,2048]); the prompt's <|AUDIO|> placeholders carry extension ids vocab + slot the graph gathers. TMRoPE collapses to 1-D for audio+text, so positions are engine-native (no rope-shift inputs). iPhone uses the AOT decoder so the 3.9 GB graph dodges the on-device JIT jetsam; the AOT weights mmap as clean pages, so it loads comfortably (β‰ˆ5.9 GB headroom after load on a 12 GB device, with the increased-memory-limit entitlement).

Use it

The coreai-audio app (record from the mic / choose a file / demo clip β†’ "what do you hear?"), or CoreAIKit:

let model = try await KitAudioModel(model: .qwen2_5Omni3B)   // downloads decoder + encoder
try await model.attach(samples: pcm16kMono)                  // mel β†’ encoder β†’ static buffer
let answer = try await LanguageModelSession(model: model).respond(to: "What do you hear?")

The 16 kHz log-mel front end is Whisper-large-v3 (Accelerate/vDSP), bit-exact with the HF feature extractor (gated cos 1.0). Any clip is decoded to 16 kHz mono, ≀ ~30 s.

Conversion / numerics

Conversion code + gates: conversion (export_qwen2_5_omni_thinker.py / export_qwen2_5_omni_audio.py). Decoder int8lin gates top-1-exact vs the fp32 HF oracle; the encoder static rework is cos 1.0 vs eager (GPU 0.99999); the Swift vDSP mel is cos 1.0 vs the HF extractor. iPhone greedy matches the Mac content (white-noise β†’ "I hear a loud hissing sound.").

License

Apache-2.0 (inherits Qwen2.5-Omni-3B). A community conversion β€” not affiliated with Alibaba or Apple.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Qwen2.5-Omni-3B-Audio-CoreAI

Finetuned
(24)
this model