GOT-OCR2 β CrispEmbed GGUF
GGUF conversion of stepfun-ai/GOT-OCR2_0 for use with CrispEmbed.
Architecture
- Vision: SAM ViT-B (12 layers, 768d, 12 heads, 16Γ16 patches, 1024Γ1024 input)
- Windowed attention (ws=14) with global attention at layers [2, 5, 8, 11]
- Decomposed relative position encoding
- Neck: Conv(768β256) β LN2d β Conv(256β256) β LN2d
- Downsample: Conv(256β512β1024, stride 2) β 256 vision tokens
- Projector: Linear(1024, 1024)
- LLM: Qwen2-0.5B (24 layers, 1024d, MHA 16/16, SiLU SwiGLU, RoPE ΞΈ=1M)
- Tokenizer: tiktoken (151860 vocab)
- Total: ~0.7B parameters
Files
| File | Quant | Size | Notes |
|---|---|---|---|
got-ocr2-f16.gguf |
F16 | 1.34 GB | Full precision |
got-ocr2-q8_0.gguf |
Q8_0 | 569 MB | Best quantized quality |
got-ocr2-q4_k.gguf |
Q4_K | 422 MB | Smallest, good quality |
Parity
All checkpoints verified at cos β₯ 0.999 against Python reference (F32):
- Vision layers (windowed + global attention)
- Neck, downsample, projector
- LLM decoder layers
Usage
crispembed --ocr got-ocr2 image.png
License
Apache-2.0 (same as upstream model)
- Downloads last month
- 184
Hardware compatibility
Log In to add your hardware
8-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for cstr/got-ocr2-crispembed-GGUF
Base model
stepfun-ai/GOT-OCR2_0