TrOCR-small Printed Text โ€” GGUF

Text recognition model for CrispEmbed. Recognizes printed text from cropped text-line images. Pair with a text detector like cstr/dbnet-ic15-GGUF for end-to-end OCR.

Architecture: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder (6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens). 61M parameters.

Source: microsoft/trocr-small-printed (MIT).

Model Variants

Variant Size Recognition quality
F32 235 MB exact match vs HuggingFace
F16 119 MB exact same tokens
Q8_0 65 MB exact same tokens

Recommended: Q8_0 (65 MB). Q4_K is not provided โ€” the 256-dim decoder bottleneck is too narrow for 4-bit quantization, causing recognition errors.

Verification (all variants produce identical output)

Input image Output
"Hello World" HELLO WORLD
"The quick brown fox" THE QUICK BROWN FOX
"42 is the answer" 42 IS THE ANSWER

Note: trocr-small-printed uppercases output (training data bias). For mixed-case, use a trocr-base model.

Usage

Full OCR pipeline (with DBNet)

crispembed --det dbnet-ic15-q4_k.gguf \
    -m trocr-small-printed-q8_0.gguf \
    --ocr document.png

Output:

[ 0] (49,53)-(143,86)   conf=0.91  "HELLO"
[ 1] (153,52)-(270,86)  conf=0.91  "WORLD!"
[ 2] (50,122)-(124,157) conf=0.91  "THIS"
...

C API

#include "crispembed.h"

void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
                                 "trocr-small-printed-q8_0.gguf", 4);
int n;
const crispembed_ocr_result *r = crispembed_ocr(ctx, "document.png", &n);
for (int i = 0; i < n; i++)
    printf("%s ", r[i].text);
crispembed_ocr_free(ctx);

Pipeline size

Detection Recognition Total Throughput
Q4_K (7 MB) Q8_0 (65 MB) 72 MB ~200ms/region

Architecture

Input: text crop (resized to 384x384, grayscale)
  |
  +-> DeiT-small encoder (12 layers)
  |     16x16 patch embedding -> 576+2 tokens (CLS + distillation)
  |     12x Pre-LN MHA (6 heads, 384d) + FFN (GELU, 1536d)
  |
  +-> TrOCR decoder (6 layers, autoregressive)
        Token + position embedding (64044 BPE vocab, 514 max positions)
        6x Self-attn (causal) + Cross-attn + FFN
        -> greedy argmax -> SentencePiece BPE detokenize

XLM-R SentencePiece tokenizer with fairseq vocab offset. Word boundaries marked by โ– (U+2581), converted to spaces at decode time.

Conversion

pip install gguf numpy transformers sentencepiece safetensors

# Download model
python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"

# Convert (embeds XLM-R tokenizer via AutoTokenizer)
python models/convert-trocr-to-gguf.py \
    --model-dir trocr-small-printed/ \
    --output trocr-small-printed-f32.gguf

# Quantize (Q8_0 recommended; Q4_K degrades this model)
crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0

License

MIT (same as microsoft/trocr-small-printed).

Downloads last month
22
GGUF
Model size
61.4M params
Architecture
trocr
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/trocr-small-printed-GGUF

Quantized
(2)
this model