Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)

Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22 Chinese dialects) and 30 additional languages. Exported to LiteRT for Android. The text decoder is a Qwen3-0.6B LLM and is intended to run through LiteRT-LM as a separate runtime.

Model

Property Value
Component Audio encoder only
Parameters ~180 M (encoder), decoder is a separate 0.6B LLM
Format LiteRT (TFLite)
Quantization INT8 dynamic weights (fp32 activations)
Sample rate 16 000 Hz
Input 128-bin log mel, 1000 frames (10 s, fixed)
Output 125 audio embedding tokens, 1024-dim each
Languages 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …)

Files

File Size Description
qwen3-asr-encoder.tflite 180.5 MB Audio encoder, INT8
config.json 1 KB Architecture + I/O specs

Signature

Inputs:
  mel               [1, 128, 1000]   float32   10 s log mel spectrogram

Outputs:
  audio_embeddings  [1, 125, 1024]   float32   For cross-attention into the decoder

Architecture

mel [1, 128, 1000]
  └── 3Γ— Conv2d(stride=2) + GELU          β†’ [1, 480, 16, 125]
  └── reshape β†’ Linear(7680β†’896)          β†’ [1, 125, 896]
  └── + sinusoidal pos embed
  └── 18Γ— pre-norm Transformer            β†’ [1, 125, 896]
  └── LayerNorm β†’ Linear(896) β†’ GELU
  └── Linear(896β†’1024)                    β†’ [1, 125, 1024]

Why encoder only

The text decoder is a full Qwen3-0.6B language model with GQA, RoPE, SwiGLU and RMSNorm. It doesn't fit cleanly into a single .tflite; the right runtime for LLM decoders on Android is LiteRT-LM or a comparable LLM executor, with the audio embeddings from this encoder wired in as cross-attention context.

For ASR-only (no LLM), pair this encoder with a CTC or transducer head fine-tuned on your target languages.

Audio preprocessing

  • 16 kHz mono, float32
  • 128 log mel bins
  • n_fft=400, hop_length=160, win_length=400, pad_mode="reflect"
  • log mel, mean/std normalization per utterance

The exact reference is in the upstream Qwen3-ASR tokenizer config.

Source

Upstream: Qwen/Qwen3-ASR-0.6B (Apache 2.0). Released January 2026 as part of the Qwen3 audio family.

Links

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (apache-2.0). See the linked base_model repository for the full terms.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8

Finetuned
(22)
this model

Collection including soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8