PosFormer Handwritten Math OCR — CROHME-trained GGUF

PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, retrained from scratch on CROHME 2014 and converted to GGUF format for CrispEmbed.

License

CC BY-NC-SA 3.0 — inherited from the CROHME 2014 training data.

Personal, educational, and research use: allowed
Apps (including commercial): allowed — the app is a separate work; the NC clause applies to the weights, not to software that loads them. Users download the weights separately and accept the NC terms.
Redistributing or selling the weights themselves: not allowed
Attribution required: cite CROHME and PosFormer (see below)
ShareAlike: derivative models must use the same or compatible license

The C++ inference engine (CrispEmbed) and GGUF converter are original clean-room implementations (MIT license).

Model details

Property	Value
Architecture	DenseNet encoder + 3-layer Transformer decoder + ARM
Parameters	6.5M
Training data	CROHME 2014 (8,835 images) + MathWriting (2,000 images)
Training data license	CC BY-NC-SA 3.0 (CROHME) + CC BY-NC-SA 4.0 (MathWriting)
Training	300 epochs, SGD+momentum, cosine annealing warm restarts, label smoothing 0.1
Vocabulary	113 LaTeX tokens (canonical PosFormer dictionary)
Input	Grayscale handwritten math image
Output	LaTeX token sequence

Training details

Retrained from scratch (no transfer learning from published weights) using the PosFormer architecture on CROHME 2014 train set. Key differences from the published training:

Cosine annealing with warm restarts (T_0=30, T_mult=2) instead of ReduceLROnPlateau
Label smoothing (ε=0.1) on cross-entropy loss
Greedy validation (beam_size=1) for faster training epochs

Training monitored via Weights & Biases.

Files

File	Quant	Size	Notes
`posformer-crohme-f32.gguf`	F32	24.9 MB	Full precision
`posformer-crohme-q8_0.gguf`	Q8_0	12 MB	Recommended for mobile
`posformer-crohme-q4_k.gguf`	Q4_K	10 MB	Smallest, lossless on test

Accuracy (CROHME 2014 test set, 986 images)

Greedy left-to-right decoding (beam_size=1, no bi-directional search):

Model	Epoch	Raw match	Notes
This model (F32)	182	60.5%	Retrained, CC BY-NC-SA 3.0/4.0
SJTU published weights	206	56.0%	Academic-only license
BTTR baseline	—	49.2%	MIT license
HMER baseline	—	36.1%	MIT license

Trained on CROHME 2014 (8835 images) + 2000 MathWriting samples (filtered to 110-token vocab). 60.5% beam=1 exact match on full CROHME 2014 test (986 images). val_ExpRate peaked at 62.0% during training. Surpasses published SJTU greedy (56.0%) by 4.5 points. LR=0.00125 after ReduceLROnPlateau drop.

Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (beam_size=10). Our greedy results are directly comparable to our C++ inference engine which uses greedy L2R decoding.

Usage with CrispEmbed

# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer

# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-crohme-q8_0.gguf image.bmp

Parity

The C++ inference engine matches PyTorch to >99.999% (cosine similarity = 1.000000 at every decoder step). See tests/parity/posformer_*.py in the CrispEmbed repo for verification scripts.

Citation

@inproceedings{chen2024posformer,
  title={PosFormer: Recognizing Complex Handwritten Mathematical Expression
         with Position Forest Transformer},
  author={Chen, Tongkun and others},
  booktitle={AAAI},
  year={2024}
}

References

Downloads last month: 398

GGUF

Model size

6.51M params

Architecture

posformer

Hardware compatibility

8-bit

32-bit