InternVL2.5-2B β€” CrispEmbed GGUF

GGUF conversions of OpenGVLab/InternVL2_5-2B for use with CrispEmbed.

Model Details

Property Value
Architecture InternVL2.5 (InternViT-300M + InternLM2.5-1.8B)
Total Parameters ~2.1B
Vision Encoder InternViT-300M-448px (24L, 1024d, 16H, LayerNorm + GELU + LayerScale)
Projector Pixel unshuffle (4:1) + LayerNorm + Linear + GELU + Linear
LLM Decoder InternLM2.5-1.8B-chat (24L, 2048d, GQA 16/8, SwiGLU, RMSNorm)
Input Resolution 448x448 per tile, dynamic tiling (1-12 tiles)
License MIT
OCRBench ~830 (top tier for <3B models)

Available Quantizations

File Size Compression Notes
internvl2.5-2b-f16.gguf 4.9 GB 1x Full precision (F16 weights, F32 norms/embeds)
internvl2.5-2b-q8_0.gguf 2.2 GB 2.2x Good quality, vision weights at Q8_0 floor
internvl2.5-2b-q4_k.gguf 880 MB 5.6x Smallest, vision weights kept at Q8_0 minimum

Note: Vision encoder weights are kept at Q8_0 minimum even in Q4_K to preserve OCR accuracy. The Q4_K savings come primarily from the LLM decoder.

Usage with CrispEmbed

#include "crispembed.h"

// Auto-detects InternVL2 architecture from GGUF metadata
void *ctx = crispembed_math_ocr_init("internvl2.5-2b-q4_k.gguf", 4);

int len;
const char *text = crispembed_math_ocr_recognize(ctx, pixels, w, h, channels, &len);
printf("%s\n", text);

crispembed_math_ocr_free(ctx);
from crispembed import CrispMathOcr

ocr = CrispMathOcr("internvl2.5-2b-q4_k.gguf")
text = ocr.recognize("document.png")

Parity Verification

All components verified against the Python reference implementation:

Stage cos_sim max_abs_diff
vis_patch_embed 1.000000 0.000003
vis_layer_0..3 1.000000 <0.001
vis_proj_output 1.000000 0.000909
llm_embed 1.000000 0.000000
llm_layer_0..1 1.000000 <0.000005

Conversion

Converted using models/convert-internvl2-to-gguf.py from CrispEmbed:

python models/convert-internvl2-to-gguf.py \
    --model OpenGVLab/InternVL2_5-2B \
    --output internvl2.5-2b-f16.gguf --dtype f16

# Then quantize with the C++ quantizer:
./crispembed-quantize internvl2.5-2b-f16.gguf internvl2.5-2b-q8_0.gguf q8_0
./crispembed-quantize internvl2.5-2b-f16.gguf internvl2.5-2b-q4_k.gguf q4_k

Architecture

Image (448x448 per tile, 1-12 tiles)
  β†’ Conv2D patch embed (14x14, stride 14) β†’ 1024 patches
  β†’ Prepend CLS + position embedding
  β†’ 24x InternViT blocks (LayerNorm β†’ MHSA β†’ LayerScale β†’ residual
                           LayerNorm β†’ GELU MLP β†’ LayerScale β†’ residual)
  β†’ Remove CLS β†’ pixel unshuffle (4:1, 1024β†’256 tokens, dim 1024β†’4096)
  β†’ LayerNorm β†’ Linear(4096β†’2048) β†’ GELU β†’ Linear(2048β†’2048)
  β†’ Splice into text token sequence
  β†’ 24x InternLM2.5 blocks (RMSNorm β†’ GQA(16/8) + RoPE β†’ residual
                              RMSNorm β†’ SwiGLU FFN β†’ residual)
  β†’ RMSNorm β†’ LM head β†’ logits β†’ greedy decode

Credits

Downloads last month
330
GGUF
Model size
2B params
Architecture
internvl2
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/internvl2.5-2b-crispembed-GGUF

Quantized
(1)
this model