GOT-OCR2 β€” CrispEmbed GGUF

GGUF conversion of stepfun-ai/GOT-OCR2_0 for use with CrispEmbed.

Architecture

  • Vision: SAM ViT-B (12 layers, 768d, 12 heads, 16Γ—16 patches, 1024Γ—1024 input)
    • Windowed attention (ws=14) with global attention at layers [2, 5, 8, 11]
    • Decomposed relative position encoding
    • Neck: Conv(768β†’256) β†’ LN2d β†’ Conv(256β†’256) β†’ LN2d
    • Downsample: Conv(256β†’512β†’1024, stride 2) β†’ 256 vision tokens
    • Projector: Linear(1024, 1024)
  • LLM: Qwen2-0.5B (24 layers, 1024d, MHA 16/16, SiLU SwiGLU, RoPE ΞΈ=1M)
  • Tokenizer: tiktoken (151860 vocab)
  • Total: ~0.7B parameters

Files

File Quant Size Notes
got-ocr2-f16.gguf F16 1.34 GB Full precision
got-ocr2-q8_0.gguf Q8_0 569 MB Best quantized quality
got-ocr2-q4_k.gguf Q4_K 422 MB Smallest, good quality

Parity

All checkpoints verified at cos β‰₯ 0.999 against Python reference (F32):

  • Vision layers (windowed + global attention)
  • Neck, downsample, projector
  • LLM decoder layers

Usage

crispembed --ocr got-ocr2 image.png

License

Apache-2.0 (same as upstream model)

Downloads last month
184
GGUF
Model size
0.6B params
Architecture
got_ocr
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/got-ocr2-crispembed-GGUF

Quantized
(2)
this model