PosFormer Handwritten Math OCR β CROHME-trained GGUF
PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, retrained from scratch on CROHME 2014 and converted to GGUF format for CrispEmbed.
License
CC BY-NC-SA 3.0 β inherited from the CROHME 2014 training data.
- Personal, educational, and research use: allowed
- Apps (including commercial): allowed β the app is a separate work; the NC clause applies to the weights, not to software that loads them. Users download the weights separately and accept the NC terms.
- Redistributing or selling the weights themselves: not allowed
- Attribution required: cite CROHME and PosFormer (see below)
- ShareAlike: derivative models must use the same or compatible license
The C++ inference engine (CrispEmbed) and GGUF converter are original clean-room implementations (MIT license).
Model details
| Property | Value |
|---|---|
| Architecture | DenseNet encoder + 3-layer Transformer decoder + ARM |
| Parameters | 6.5M |
| Training data | CROHME 2014 (8,835 images) + MathWriting (2,000 images) |
| Training data license | CC BY-NC-SA 3.0 (CROHME) + CC BY-NC-SA 4.0 (MathWriting) |
| Training | 300 epochs, SGD+momentum, cosine annealing warm restarts, label smoothing 0.1 |
| Vocabulary | 113 LaTeX tokens (canonical PosFormer dictionary) |
| Input | Grayscale handwritten math image |
| Output | LaTeX token sequence |
Training details
Retrained from scratch (no transfer learning from published weights) using the PosFormer architecture on CROHME 2014 train set. Key differences from the published training:
- Cosine annealing with warm restarts (T_0=30, T_mult=2) instead of ReduceLROnPlateau
- Label smoothing (Ξ΅=0.1) on cross-entropy loss
- Greedy validation (beam_size=1) for faster training epochs
Training monitored via Weights & Biases.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
posformer-crohme-f32.gguf |
F32 | 24.9 MB | Full precision |
posformer-crohme-q8_0.gguf |
Q8_0 | 12 MB | Recommended for mobile |
posformer-crohme-q4_k.gguf |
Q4_K | 10 MB | Smallest, lossless on test |
Accuracy (CROHME 2014 test set, 986 images)
Greedy left-to-right decoding (beam_size=1, no bi-directional search):
| Model | Epoch | Raw match | Notes |
|---|---|---|---|
| This model (F32) | 182 | 60.5% | Retrained, CC BY-NC-SA 3.0/4.0 |
| SJTU published weights | 206 | 56.0% | Academic-only license |
| BTTR baseline | β | 49.2% | MIT license |
| HMER baseline | β | 36.1% | MIT license |
Trained on CROHME 2014 (8835 images) + 2000 MathWriting samples (filtered to 110-token vocab). 60.5% beam=1 exact match on full CROHME 2014 test (986 images). val_ExpRate peaked at 62.0% during training. Surpasses published SJTU greedy (56.0%) by 4.5 points. LR=0.00125 after ReduceLROnPlateau drop.
Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (beam_size=10). Our greedy results are directly comparable to our C++ inference engine which uses greedy L2R decoding.
Usage with CrispEmbed
# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer
# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-crohme-q8_0.gguf image.bmp
Parity
The C++ inference engine matches PyTorch to >99.999% (cosine similarity
= 1.000000 at every decoder step). See tests/parity/posformer_*.py
in the CrispEmbed repo for verification scripts.
Citation
@inproceedings{chen2024posformer,
title={PosFormer: Recognizing Complex Handwritten Mathematical Expression
with Position Forest Transformer},
author={Chen, Tongkun and others},
booktitle={AAAI},
year={2024}
}
References
- Downloads last month
- 398
8-bit
32-bit