Qwen3-VL-2B-Instruct GGUF (CrispEmbed format)
GGUF conversion of Qwen/Qwen3-VL-2B-Instruct for use with the CrispEmbed inference engine.
Files
| File | Size | Description |
|---|---|---|
qwen3-vl-2b-f16.gguf |
4.6 GB | Full precision (FP16) |
qwen3-vl-2b-q8_0.gguf |
2.2 GB | 8-bit quantization (2.1x compression) |
qwen3-vl-2b-q4_k.gguf |
1.5 GB | 4-bit quantization (3.1x compression) |
qwen3-vl-2b-diff-ref.gguf |
38 MB | Reference activations for parity testing |
test_small.png |
197 KB | Test image (256x256, random seed 42) |
Architecture
Qwen3-VL-2B is a vision-language model with:
- Vision encoder: 24-layer ViT (1024d, patch_size=16, learned bilinear position embeddings + 2D RoPE)
- DeepStack: Intermediate vision features from layers 5, 11, 17 injected into LLM layers 0-2
- LLM decoder: 28-layer Qwen3 (2048d, 16 heads, 8 KV heads, interleaved mRoPE, QK RMSNorm)
- Tokenizer: GPT-2 BPE (151,669 tokens)
Usage with CrispEmbed
# OCR
crispembed -m qwen3-vl-2b-q8_0.gguf --ocr document.png
# Parity test (crispembed-diff)
test-qwen2vl-diff qwen3-vl-2b-f16.gguf qwen3-vl-2b-diff-ref.gguf test_small.png
Parity Verification
Full per-layer parity against Python reference (pure numpy forward pass):
| Stage | cos_min |
|---|---|
| Vision patch embed + bilinear pos | 1.000000 |
| Vision layers 0-23 | >= 0.984 |
| Vision merger | 0.999831 |
| DeepStack mergers (3x) | >= 0.999 |
| LLM embed (spliced) | 1.000000 |
| LLM Q after IMROPE | 1.000000 |
| LLM layer 0 | 1.000000 |
| LLM layer 1 | 0.999995 |
Conversion
python models/convert-qwen3vl-to-gguf.py \
--model Qwen/Qwen3-VL-2B-Instruct \
--output qwen3-vl-2b-f16.gguf --dtype f16
crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q8_0.gguf q8_0
crispembed-quantize qwen3-vl-2b-f16.gguf qwen3-vl-2b-q4_k.gguf q4_k
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for cstr/qwen3-vl-2b-crispembed-gguf
Base model
Qwen/Qwen3-VL-2B-Instruct