GOT-OCR2 — CrispEmbed GGUF

GGUF conversion of stepfun-ai/GOT-OCR2_0 for use with CrispEmbed.

Architecture

Vision: SAM ViT-B (12 layers, 768d, 12 heads, 16×16 patches, 1024×1024 input)
- Windowed attention (ws=14) with global attention at layers [2, 5, 8, 11]
- Decomposed relative position encoding
- Neck: Conv(768→256) → LN2d → Conv(256→256) → LN2d
- Downsample: Conv(256→512→1024, stride 2) → 256 vision tokens
- Projector: Linear(1024, 1024)
LLM: Qwen2-0.5B (24 layers, 1024d, MHA 16/16, SiLU SwiGLU, RoPE θ=1M)
Tokenizer: tiktoken (151860 vocab)
Total: ~0.7B parameters

File	Quant	Size	Notes
`got-ocr2-f16.gguf`	F16	1.34 GB	Full precision
`got-ocr2-q8_0.gguf`	Q8_0	569 MB	Best quantized quality
`got-ocr2-q4_k.gguf`	Q4_K	422 MB	Smallest, good quality

All checkpoints verified at cos ≥ 0.999 against Python reference (F32):

crispembed --ocr got-ocr2 image.png

Apache-2.0 (same as upstream model)

GGUF

Model size

0.6B params

Architecture

got_ocr

Hardware compatibility

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Quantized

(2)

this model