PosFormer Handwritten Math OCR β€” CROHME-trained GGUF

PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, retrained from scratch on CROHME 2014 and converted to GGUF format for CrispEmbed.

License

CC BY-NC-SA 3.0 β€” inherited from the CROHME 2014 training data.

  • Personal, educational, and research use: allowed
  • Apps (including commercial): allowed β€” the app is a separate work; the NC clause applies to the weights, not to software that loads them. Users download the weights separately and accept the NC terms.
  • Redistributing or selling the weights themselves: not allowed
  • Attribution required: cite CROHME and PosFormer (see below)
  • ShareAlike: derivative models must use the same or compatible license

The C++ inference engine (CrispEmbed) and GGUF converter are original clean-room implementations (MIT license).

Model details

Property Value
Architecture DenseNet encoder + 3-layer Transformer decoder + ARM
Parameters 6.5M
Training data CROHME 2014 (8,835 images) + MathWriting (2,000 images)
Training data license CC BY-NC-SA 3.0 (CROHME) + CC BY-NC-SA 4.0 (MathWriting)
Training 300 epochs, SGD+momentum, cosine annealing warm restarts, label smoothing 0.1
Vocabulary 113 LaTeX tokens (canonical PosFormer dictionary)
Input Grayscale handwritten math image
Output LaTeX token sequence

Training details

Retrained from scratch (no transfer learning from published weights) using the PosFormer architecture on CROHME 2014 train set. Key differences from the published training:

  • Cosine annealing with warm restarts (T_0=30, T_mult=2) instead of ReduceLROnPlateau
  • Label smoothing (Ξ΅=0.1) on cross-entropy loss
  • Greedy validation (beam_size=1) for faster training epochs

Training monitored via Weights & Biases.

Files

File Quant Size Notes
posformer-crohme-f32.gguf F32 24.9 MB Full precision
posformer-crohme-q8_0.gguf Q8_0 12 MB Recommended for mobile
posformer-crohme-q4_k.gguf Q4_K 10 MB Smallest, lossless on test

Accuracy (CROHME 2014 test set, 986 images)

Greedy left-to-right decoding (beam_size=1, no bi-directional search):

Model Epoch Raw match Notes
This model (F32) 182 60.5% Retrained, CC BY-NC-SA 3.0/4.0
SJTU published weights 206 56.0% Academic-only license
BTTR baseline β€” 49.2% MIT license
HMER baseline β€” 36.1% MIT license

Trained on CROHME 2014 (8835 images) + 2000 MathWriting samples (filtered to 110-token vocab). 60.5% beam=1 exact match on full CROHME 2014 test (986 images). val_ExpRate peaked at 62.0% during training. Surpasses published SJTU greedy (56.0%) by 4.5 points. LR=0.00125 after ReduceLROnPlateau drop.

Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (beam_size=10). Our greedy results are directly comparable to our C++ inference engine which uses greedy L2R decoding.

Usage with CrispEmbed

# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer

# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-crohme-q8_0.gguf image.bmp

Parity

The C++ inference engine matches PyTorch to >99.999% (cosine similarity = 1.000000 at every decoder step). See tests/parity/posformer_*.py in the CrispEmbed repo for verification scripts.

Citation

@inproceedings{chen2024posformer,
  title={PosFormer: Recognizing Complex Handwritten Mathematical Expression
         with Position Forest Transformer},
  author={Chen, Tongkun and others},
  booktitle={AAAI},
  year={2024}
}

References

Downloads last month
398
GGUF
Model size
6.51M params
Architecture
posformer
Hardware compatibility
Log In to add your hardware

8-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support