Qwen3-VL-2B-Instruct GGUF (CrispEmbed format)

GGUF conversion of Qwen/Qwen3-VL-2B-Instruct for use with the CrispEmbed inference engine.

Files

File Size Description
qwen3-vl-2b-f16.gguf 4.6 GB Full precision (FP16)
qwen3-vl-2b-q8_0.gguf 2.2 GB 8-bit quantization (2.1x compression)
qwen3-vl-2b-q4_k.gguf 1.5 GB 4-bit quantization (3.1x compression)
qwen3-vl-2b-diff-ref.gguf 38 MB Reference activations for parity testing
test_small.png 197 KB Test image (256x256, random seed 42)

Architecture

Qwen3-VL-2B is a vision-language model with:

  • Vision encoder: 24-layer ViT (1024d, patch_size=16, learned bilinear position embeddings + 2D RoPE)
  • DeepStack: Intermediate vision features from layers 5, 11, 17 injected into LLM layers 0-2
  • LLM decoder: 28-layer Qwen3 (2048d, 16 heads, 8 KV heads, interleaved mRoPE, QK RMSNorm)
  • Tokenizer: GPT-2 BPE (151,669 tokens)

Usage with CrispEmbed

# OCR
crispembed -m qwen3-vl-2b-q8_0.gguf --ocr document.png

# Parity test (crispembed-diff)
test-qwen2vl-diff qwen3-vl-2b-f16.gguf qwen3-vl-2b-diff-ref.gguf test_small.png

Parity Verification

Full per-layer parity against Python reference (pure numpy forward pass):

Stage cos_min
Vision patch embed + bilinear pos 1.000000
Vision layers 0-23 >= 0.984
Vision merger 0.999831
DeepStack mergers (3x) >= 0.999
LLM embed (spliced) 1.000000
LLM Q after IMROPE 1.000000
LLM layer 0 1.000000
LLM layer 1 0.999995

Conversion

python models/convert-qwen3vl-to-gguf.py \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --output qwen3-vl-2b-f16.gguf --dtype f16

crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q8_0.gguf q8_0
crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q4_k.gguf q4_k
Downloads last month
-
GGUF
Model size
9.81M params
Architecture
qwen3vl_ref
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/qwen3-vl-2b-crispembed-gguf

Quantized
(73)
this model