TrOCR-small Printed Text — GGUF

Text recognition model for CrispEmbed. Recognizes printed text from cropped text-line images. Pair with a text detector like cstr/dbnet-ic15-GGUF for end-to-end OCR.

Architecture: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder (6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens). 61M parameters.

Source: microsoft/trocr-small-printed (MIT).

Model Variants

Variant	Size	Recognition quality
F32	235 MB	exact match vs HuggingFace
F16	119 MB	exact same tokens
Q8_0	65 MB	exact same tokens

Recommended: Q8_0 (65 MB). Q4_K is not provided — the 256-dim decoder bottleneck is too narrow for 4-bit quantization, causing recognition errors.

Verification (all variants produce identical output)

Input image	Output
"Hello World"	HELLO WORLD
"The quick brown fox"	THE QUICK BROWN FOX
"42 is the answer"	42 IS THE ANSWER

Note: trocr-small-printed uppercases output (training data bias). For mixed-case, use a trocr-base model.

Usage

Full OCR pipeline (with DBNet)

crispembed --det dbnet-ic15-q4_k.gguf \
    -m trocr-small-printed-q8_0.gguf \
    --ocr document.png

Output:

[ 0] (49,53)-(143,86)   conf=0.91  "HELLO"
[ 1] (153,52)-(270,86)  conf=0.91  "WORLD!"
[ 2] (50,122)-(124,157) conf=0.91  "THIS"
...

C API

#include "crispembed.h"

void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
                                 "trocr-small-printed-q8_0.gguf", 4);
int n;
const crispembed_ocr_result *r = crispembed_ocr(ctx, "document.png", &n);
for (int i = 0; i < n; i++)
    printf("%s ", r[i].text);
crispembed_ocr_free(ctx);

Pipeline size

Detection	Recognition	Total	Throughput
Q4_K (7 MB)	Q8_0 (65 MB)	72 MB	~200ms/region

Architecture

Input: text crop (resized to 384x384, grayscale)
  |
  +-> DeiT-small encoder (12 layers)
  |     16x16 patch embedding -> 576+2 tokens (CLS + distillation)
  |     12x Pre-LN MHA (6 heads, 384d) + FFN (GELU, 1536d)
  |
  +-> TrOCR decoder (6 layers, autoregressive)
        Token + position embedding (64044 BPE vocab, 514 max positions)
        6x Self-attn (causal) + Cross-attn + FFN
        -> greedy argmax -> SentencePiece BPE detokenize

XLM-R SentencePiece tokenizer with fairseq vocab offset. Word boundaries marked by ▁ (U+2581), converted to spaces at decode time.

Conversion

pip install gguf numpy transformers sentencepiece safetensors

# Download model
python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"

# Convert (embeds XLM-R tokenizer via AutoTokenizer)
python models/convert-trocr-to-gguf.py \
    --model-dir trocr-small-printed/ \
    --output trocr-small-printed-f32.gguf

# Quantize (Q8_0 recommended; Q4_K degrades this model)
crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0

License

MIT (same as microsoft/trocr-small-printed).

Downloads last month: 22

GGUF

Model size

61.4M params

Architecture

trocr

Hardware compatibility

8-bit

16-bit

32-bit

Model tree for cstr/trocr-small-printed-GGUF

Base model

microsoft/trocr-small-printed

Quantized

(2)

this model