Qwen2.5-Omni-3B Audio Understanding — Core AI

Qwen2.5-Omni-3B's Thinker converted to Apple Core AI (.aimodel / .aimodelc, iOS 27 / macOS 27) for on-device audio understanding — the model describes the sounds it hears (events, texture, emotion, music), it is not a transcriber. "I hear a loud hissing sound." · "…a continuous sine wave sound." · "…a series of beeps."

Part of the CoreAI-Model-Zoo. Device-verified on iPhone 17 Pro (A19 Pro) and M4 Max.

What's here

Two models, run as a pair on the coreai-pipelined GPU engine:

path	what	size
`gpu-pipelined/qwen2_5_omni_3b_thinker_int8lin_n750_s1/`	text decoder (Qwen2.5-3B int8lin, S=1) — macOS	3.9 GB
`gpu-pipelined/qwen2_5_omni_3b_audio_encoder_fp16_k15.aimodel/`	Whisper-style audio encoder (fp16, K=15 ≈ 30 s) — both platforms	1.2 GB
`ios/qwen2_5_omni_3b_thinker_n750_ios/`	text decoder AOT (`.aimodelc`, iPhone 17 Pro / h18p)	4.5 GB

The decoder's audio embeds ride one static-input buffer (audio_embeds [750,2048]); the prompt's <|AUDIO|> placeholders carry extension ids vocab + slot the graph gathers. TMRoPE collapses to 1-D for audio+text, so positions are engine-native (no rope-shift inputs). iPhone uses the AOT decoder so the 3.9 GB graph dodges the on-device JIT jetsam; the AOT weights mmap as clean pages, so it loads comfortably (≈5.9 GB headroom after load on a 12 GB device, with the increased-memory-limit entitlement).

Use it

The coreai-audio app (record from the mic / choose a file / demo clip → "what do you hear?"), or CoreAIKit:

let model = try await KitAudioModel(model: .qwen2_5Omni3B)   // downloads decoder + encoder
try await model.attach(samples: pcm16kMono)                  // mel → encoder → static buffer
let answer = try await LanguageModelSession(model: model).respond(to: "What do you hear?")

The 16 kHz log-mel front end is Whisper-large-v3 (Accelerate/vDSP), bit-exact with the HF feature extractor (gated cos 1.0). Any clip is decoded to 16 kHz mono, ≤ ~30 s.

Conversion / numerics

Conversion code + gates: conversion (export_qwen2_5_omni_thinker.py / export_qwen2_5_omni_audio.py). Decoder int8lin gates top-1-exact vs the fp32 HF oracle; the encoder static rework is cos 1.0 vs eager (GPU 0.99999); the Swift vDSP mel is cos 1.0 vs the HF extractor. iPhone greedy matches the Mac content (white-noise → "I hear a loud hissing sound.").

License

Apache-2.0 (inherits Qwen2.5-Omni-3B). A community conversion — not affiliated with Alibaba or Apple.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Qwen2.5-Omni-3B-Audio-CoreAI

Base model

Qwen/Qwen2.5-Omni-3B

Finetuned

(24)

this model