MOSS-Audio-4B-Instruct -- GGUF (ggml-quantised)

GGUF / ggml conversions of OpenMOSS-Team/MOSS-Audio-4B-Instruct for use with crispasr --backend moss-audio from CrispStrobe/CrispASR.

MOSS-Audio-4B-Instruct is OpenMOSS's ~4.6 B parameter audio-understanding model:

  • First audio-understanding backend in CrispASR -- not just ASR but also audio QA, scene description, music analysis, meeting summarisation
  • Mandarin + English speech recognition and audio understanding
  • DeepStack cross-layer feature injection -- multi-resolution encoder taps at layers 8/16/24 injected into the LM's early layers for fine-grained prosody + semantic awareness
  • Time-aware ASR with explicit time-marker tokens for word-level and sentence-level timestamps
  • Apache-2.0 licence

Files

File Size Notes
moss-audio-4b-instruct-f16.gguf 9.73 GB F16, full precision
moss-audio-4b-instruct-q4_k.gguf 2.75 GB Q4_K -- recommended default

Quick Start

# 1. Build the runtime
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF
cmake --build build -j$(nproc) --target crispasr-cli

# 2. Download a quantisation
huggingface-cli download cstr/MOSS-Audio-4B-Instruct-GGUF \
    moss-audio-4b-instruct-q4_k.gguf --local-dir .

# 3. Transcribe audio
./build/bin/crispasr \
    -m moss-audio-4b-instruct-q4_k.gguf \
    -f your-audio.wav \
    --backend moss-audio -t 4

# 4. Audio understanding (custom prompt)
./build/bin/crispasr \
    -m moss-audio-4b-instruct-q4_k.gguf \
    -f your-audio.wav \
    --backend moss-audio \
    --prompt "Describe the sounds in this audio clip."

Verified end-to-end output

JFK sample (samples/jfk.wav, 11s):

And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

Verified on Q4_K (3.8 GB, F16 encoder + Q4_K LLM). All 6 crispasr-diff stages PASS at cos >= 0.999.

Architecture

Component Details
Audio encoder 32-layer Whisper-style transformer (d=1280, 20 heads, head_dim=64, FFN=5120, GELU, LayerNorm, eps=1e-5)
Conv stem 3x Conv2d(stride=2, channels=480, kernel=3x3, pad=1) -> 8x temporal downsample (128 mel bins -> 16 freq bins)
Stem projection Linear(480x16=7680 -> 1280) + sinusoidal positional embedding (max 1500 positions)
DeepStack taps Encoder layers [8, 16, 24] -> 3 independent GatedMLP(1280 -> 8192 -> 2560, SiLU)
DeepStack injection Residual add at LM layers [0, 1, 2] at audio-token positions
Audio adapter GatedMLP(1280 -> 8192 -> 2560, SiLU) for final encoder output
LM backbone 36-layer Qwen3 (hidden=2560, 32 Q-heads / 8 KV-heads, head_dim=128, QK-norm, SwiGLU FFN=9728, RoPE theta=1M)
Output head Linear(2560 -> 151936), untied from embedding
Vocab 151936 Qwen3 BPE (151643 regular + 293 special tokens)
Audio input 16 kHz mono, 128 mel bins, n_fft=400, hop=160
Audio tokens 12.5 Hz after 8x conv downsample, time markers every 2 seconds
Parameters ~4.6 B total (encoder ~650M + adapter/deepstack ~120M + LM ~3.8B)

Special tokens

Token ID Purpose
<|AUDIO|> 151654 Audio frame placeholder (replaced by encoder embeddings)
<|audio_bos|> 151669 Audio segment start marker
<|audio_eos|> 151670 Audio segment end marker
<|im_start|> 151644 Chat turn start
<|im_end|> 151645 Chat turn end / EOS

How this was made

  1. Inspect the HF safetensors: 3 shards, 901 tensors total -- audio encoder (conv stem + 32 transformer layers + layer_norm), audio adapter (1 GatedMLP), deepstack mergers (3 GatedMLPs), language model (embedding + 36 Qwen3 layers + final norm + lm_head).

  2. Convert with models/convert-moss-audio-to-gguf.py: stream BF16 tensors one-at-a-time via safe_open, remap HF tensor names (audio_encoder.layers.N.self_attn.q_proj -> enc.blk.N.attn.q, deepstack_audio_merger_list.N.gate_proj -> deepstack.N.gate, language_model.layers.N.mlp.gate_proj -> llm.blk.N.ffn.gate, etc.), write F16 + F32 (norms/biases). BPE vocab + merges from vocab.json + merges.txt + added_tokens.json.

  3. Quantize with crispasr-quantize: F16 -> Q4_K (2D+ tensors quantised, 1D biases/norms kept F32).

  4. C++ runtime in src/moss_audio.{h,cpp}: GGUF mmap, encoder graph (conv stem + 32 WhisperEncoderLayers with bidirectional attention + DeepStack tap capture), adapter/merger GatedMLP graphs, per-layer DeepStack injection into LM via pre-scattered residuals, KV-cached Qwen3 decode with core_attn::kv_self_attn (QK-norm, RoPE, GQA), greedy decode with chat-template prompt builder.

Upstream

Downloads last month
379
GGUF
Model size
5B params
Architecture
moss_audio
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/MOSS-Audio-4B-Instruct-GGUF

Quantized
(1)
this model