Qwen3-VL-2B-Instruct GGUF (CrispEmbed format)

GGUF conversion of Qwen/Qwen3-VL-2B-Instruct for use with the CrispEmbed inference engine.

Files

File	Size	Description
`qwen3-vl-2b-f16.gguf`	4.6 GB	Full precision (FP16)
`qwen3-vl-2b-q8_0.gguf`	2.2 GB	8-bit quantization (2.1x compression)
`qwen3-vl-2b-q4_k.gguf`	1.5 GB	4-bit quantization (3.1x compression)
`qwen3-vl-2b-diff-ref.gguf`	38 MB	Reference activations for parity testing
`test_small.png`	197 KB	Test image (256x256, random seed 42)

Architecture

Qwen3-VL-2B is a vision-language model with:

Vision encoder: 24-layer ViT (1024d, patch_size=16, learned bilinear position embeddings + 2D RoPE)
DeepStack: Intermediate vision features from layers 5, 11, 17 injected into LLM layers 0-2
LLM decoder: 28-layer Qwen3 (2048d, 16 heads, 8 KV heads, interleaved mRoPE, QK RMSNorm)
Tokenizer: GPT-2 BPE (151,669 tokens)

Usage with CrispEmbed

# OCR
crispembed -m qwen3-vl-2b-q8_0.gguf --ocr document.png

# Parity test (crispembed-diff)
test-qwen2vl-diff qwen3-vl-2b-f16.gguf qwen3-vl-2b-diff-ref.gguf test_small.png

Parity Verification

Full per-layer parity against Python reference (pure numpy forward pass):

Stage	cos_min
Vision patch embed + bilinear pos	1.000000
Vision layers 0-23	>= 0.984
Vision merger	0.999831
DeepStack mergers (3x)	>= 0.999
LLM embed (spliced)	1.000000
LLM Q after IMROPE	1.000000
LLM layer 0	1.000000
LLM layer 1	0.999995

Conversion

python models/convert-qwen3vl-to-gguf.py \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --output qwen3-vl-2b-f16.gguf --dtype f16

crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q8_0.gguf q8_0
crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q4_k.gguf q4_k

Downloads last month: -

GGUF

Model size

9.81M params

Architecture

qwen3vl_ref

Hardware compatibility

8-bit

16-bit

View +1 variant

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/qwen3-vl-2b-crispembed-gguf

Base model

Qwen/Qwen3-VL-2B-Instruct

Quantized

(73)

this model