HMER Handwritten Math OCR (GGUF)
On-device handwritten mathematical expression recognition. Converts images of handwritten math into LaTeX.
Source
- Original model: whywhs/Pytorch-Handwritten-Mathematical-Expression-Recognition (MIT license)
- Paper: Zhang et al., "Watch, Attend and Parse: An end-to-end neural network based approach to handwritten mathematical expression recognition", Pattern Recognition 2017
- Training data: CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions)
- GGUF conversion: CrispStrobe/CrispEmbed (
models/convert-hmer-to-gguf.py) - Inference engine: CrispEmbed C++ (
src/hmer_ocr.cpp) โ DenseNet-121 + GRU attention decoder via ggml - Checkpoint used:
encoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl+attn_decoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl
Architecture
| Component | Details |
|---|---|
| Encoder | DenseNet-121 (3 dense blocks: 6+12+24 layers, 2-channel input) |
| Decoder | 2x GRUCell + Bahdanau attention + coverage mechanism |
| Parameters | 6.8M (293 GGUF tensors) |
| Vocabulary | 112 LaTeX tokens |
| Input | Variable-size grayscale image + padding mask |
| Output | LaTeX token sequence (greedy decoding) |
Model variants
| File | Size | Format | Notes |
|---|---|---|---|
hmer-hw-f32.gguf |
26 MB | F32 | Full precision, verified parity with PyTorch |
hmer-hw-f16.gguf |
13 MB | F16 | Half precision |
hmer-hw-q8_0.gguf |
7 MB | Q8_0 | 8-bit quantized |
hmer-hw-q4_k.gguf |
4 MB | Q4_K | 4-bit quantized, best for mobile |
Supported symbols (112 tokens)
Digits: 0-9 Latin lowercase: a-z Latin uppercase: A, B, C, E, F, G, H, I, L, M, N, P, R, S, T, V, X, Y Greek: alpha, beta, gamma, delta, theta, sigma, lambda, mu, pi, phi Operators: + - = / * ! , . Relations: < > <= >= != in Functions: sin cos tan log lim Structural: frac sqrt sum int ^ _ { } ( ) [ ] | forall exists infty Other: pm times div cdot prime rightarrow ldots cdots limits
Usage with CrispEmbed
C API
#include "hmer_ocr.h"
hmer_ocr_context * ctx = hmer_ocr_init("hmer-hw-f32.gguf", 4);
// From grayscale float pixels [0,1]
int len;
const char * latex = hmer_ocr_recognize(ctx, pixels, width, height, &len);
printf("LaTeX: %s\n", latex);
hmer_ocr_free(ctx);
Dart / Flutter
import 'package:crispembed/crispembed.dart';
final ocr = CrispEmbedHmerOcr('hmer-hw-f32.gguf', nThreads: 4);
final latex = ocr.recognizeGray(grayPixels, width, height);
print(latex); // "\frac { x ^ { 2 } + 1 } { 2 }"
ocr.dispose();
Python (via ctypes)
from crispembed import CrispEmbed
ce = CrispEmbed("hmer-hw-f32.gguf")
# Use via C API bindings
How it works
Image preprocessing: Grayscale input normalized to [0,1], with a binary mask channel (1=valid, 0=padded). No fixed resolution required.
DenseNet-121 encoder: 3 dense blocks with bottleneck layers (BN-ReLU-Conv1x1-BN-ReLU-Conv3x3), transition layers (BN-Conv1x1-AvgPool), producing a 1024-channel spatial feature map at 16x downsampling.
GRU attention decoder: Two GRU cells with Bahdanau additive attention and a coverage mechanism. At each step:
- Embed previous token + GRU1 produces query
- Attention computes context vector from encoder features
- Coverage conv prevents re-attending to same regions
- GRU2 updates hidden state
- Linear projection produces next-token logits
Greedy decoding: Argmax over 112 LaTeX tokens until
<eol>or max 48 steps.
GGUF conversion
BatchNorm layers are folded at conversion time:
- Post-conv BN (stem): folded into conv weight+bias
- Pre-activation BN (dense layers, transitions): precomputed as scale+offset
- Attention BN (decoder bn1): precomputed as scale+offset
This eliminates all running_mean/running_var tensors from the model.
python models/convert-hmer-to-gguf.py \
--model-dir /path/to/Pytorch-HMER/model \
--dict /path/to/Pytorch-HMER/dictionary.txt \
--output hmer-hw-f32.gguf
Training data
Trained on CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions). The dataset contains handwritten math expressions with LaTeX ground truth annotations.
Important: image format
The model expects white strokes on black background (CROHME convention). The C++ inference layer handles this automatically:
- Auto-inversion: if mean pixel > 0.5, image is inverted (black-on-white โ white-on-black)
- Auto-scaling: images larger than 100K pixels are scaled down with bilinear interpolation (e.g. 4000ร3000 camera photo โ 365ร273)
Accuracy
Tested on CROHME 2016 offline test set (986 images):
- Exact match: ~58% on a 19-sample subset
- Bit-exact parity with the original PyTorch implementation verified
- Common errors: confusing similar symbols (p/beta, 1/2 in subscripts, geq/z)
License
MIT (same as the original Pytorch-HMER repository).
Citation
If you use this model, please cite the original WAP paper:
@inproceedings{zhang2017watch,
title={Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition},
author={Zhang, Jianshu and Du, Jun and Zhang, Shiliang and Liu, Dan and Hu, Yulong and Hu, Jinshui and Wei, Si and Dai, Lirong},
journal={Pattern Recognition},
year={2017}
}
- Downloads last month
- 179
8-bit
16-bit
32-bit