HMER Handwritten Math OCR (GGUF)

On-device handwritten mathematical expression recognition. Converts images of handwritten math into LaTeX.

Source

Original model: whywhs/Pytorch-Handwritten-Mathematical-Expression-Recognition (MIT license)
Paper: Zhang et al., "Watch, Attend and Parse: An end-to-end neural network based approach to handwritten mathematical expression recognition", Pattern Recognition 2017
Training data: CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions)
GGUF conversion: CrispStrobe/CrispEmbed (models/convert-hmer-to-gguf.py)
Inference engine: CrispEmbed C++ (src/hmer_ocr.cpp) — DenseNet-121 + GRU attention decoder via ggml
Checkpoint used: encoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl + attn_decoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl

Architecture

Component	Details
Encoder	DenseNet-121 (3 dense blocks: 6+12+24 layers, 2-channel input)
Decoder	2x GRUCell + Bahdanau attention + coverage mechanism
Parameters	6.8M (293 GGUF tensors)
Vocabulary	112 LaTeX tokens
Input	Variable-size grayscale image + padding mask
Output	LaTeX token sequence (greedy decoding)

Model variants

File	Size	Format	Notes
`hmer-hw-f32.gguf`	26 MB	F32	Full precision, verified parity with PyTorch
`hmer-hw-f16.gguf`	13 MB	F16	Half precision
`hmer-hw-q8_0.gguf`	7 MB	Q8_0	8-bit quantized
`hmer-hw-q4_k.gguf`	4 MB	Q4_K	4-bit quantized, best for mobile

Supported symbols (112 tokens)

Digits: 0-9 Latin lowercase: a-z Latin uppercase: A, B, C, E, F, G, H, I, L, M, N, P, R, S, T, V, X, Y Greek: alpha, beta, gamma, delta, theta, sigma, lambda, mu, pi, phi Operators: + - = / * ! , . Relations: < > <= >= != in Functions: sin cos tan log lim Structural: frac sqrt sum int ^ _ { } ( ) [ ] | forall exists infty Other: pm times div cdot prime rightarrow ldots cdots limits

Usage with CrispEmbed

C API

#include "hmer_ocr.h"

hmer_ocr_context * ctx = hmer_ocr_init("hmer-hw-f32.gguf", 4);

// From grayscale float pixels [0,1]
int len;
const char * latex = hmer_ocr_recognize(ctx, pixels, width, height, &len);
printf("LaTeX: %s\n", latex);

hmer_ocr_free(ctx);

Dart / Flutter

import 'package:crispembed/crispembed.dart';

final ocr = CrispEmbedHmerOcr('hmer-hw-f32.gguf', nThreads: 4);
final latex = ocr.recognizeGray(grayPixels, width, height);
print(latex); // "\frac { x ^ { 2 } + 1 } { 2 }"
ocr.dispose();

Python (via ctypes)

from crispembed import CrispEmbed
ce = CrispEmbed("hmer-hw-f32.gguf")
# Use via C API bindings

How it works

Image preprocessing: Grayscale input normalized to [0,1], with a binary mask channel (1=valid, 0=padded). No fixed resolution required.
DenseNet-121 encoder: 3 dense blocks with bottleneck layers (BN-ReLU-Conv1x1-BN-ReLU-Conv3x3), transition layers (BN-Conv1x1-AvgPool), producing a 1024-channel spatial feature map at 16x downsampling.
GRU attention decoder: Two GRU cells with Bahdanau additive attention and a coverage mechanism. At each step:
- Embed previous token + GRU1 produces query
- Attention computes context vector from encoder features
- Coverage conv prevents re-attending to same regions
- GRU2 updates hidden state
- Linear projection produces next-token logits
Greedy decoding: Argmax over 112 LaTeX tokens until <eol> or max 48 steps.

GGUF conversion

BatchNorm layers are folded at conversion time:

Post-conv BN (stem): folded into conv weight+bias
Pre-activation BN (dense layers, transitions): precomputed as scale+offset
Attention BN (decoder bn1): precomputed as scale+offset

This eliminates all running_mean/running_var tensors from the model.

python models/convert-hmer-to-gguf.py \
    --model-dir /path/to/Pytorch-HMER/model \
    --dict /path/to/Pytorch-HMER/dictionary.txt \
    --output hmer-hw-f32.gguf

Training data

Trained on CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions). The dataset contains handwritten math expressions with LaTeX ground truth annotations.

Important: image format

The model expects white strokes on black background (CROHME convention). The C++ inference layer handles this automatically:

Auto-inversion: if mean pixel > 0.5, image is inverted (black-on-white → white-on-black)
Auto-scaling: images larger than 100K pixels are scaled down with bilinear interpolation (e.g. 4000×3000 camera photo → 365×273)

Accuracy

Tested on CROHME 2016 offline test set (986 images):

Exact match: ~58% on a 19-sample subset
Bit-exact parity with the original PyTorch implementation verified
Common errors: confusing similar symbols (p/beta, 1/2 in subscripts, geq/z)

License

MIT (same as the original Pytorch-HMER repository).

Citation

If you use this model, please cite the original WAP paper:

@inproceedings{zhang2017watch,
  title={Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition},
  author={Zhang, Jianshu and Du, Jun and Zhang, Shiliang and Liu, Dan and Hu, Yulong and Hu, Jinshui and Wei, Si and Dai, Lirong},
  journal={Pattern Recognition},
  year={2017}
}

Downloads last month: 179

GGUF

Model size

6.81M params

Architecture

hmer

Hardware compatibility

8-bit

16-bit

32-bit