HMER Handwritten Math OCR (GGUF)

On-device handwritten mathematical expression recognition. Converts images of handwritten math into LaTeX.

Source

  • Original model: whywhs/Pytorch-Handwritten-Mathematical-Expression-Recognition (MIT license)
  • Paper: Zhang et al., "Watch, Attend and Parse: An end-to-end neural network based approach to handwritten mathematical expression recognition", Pattern Recognition 2017
  • Training data: CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions)
  • GGUF conversion: CrispStrobe/CrispEmbed (models/convert-hmer-to-gguf.py)
  • Inference engine: CrispEmbed C++ (src/hmer_ocr.cpp) โ€” DenseNet-121 + GRU attention decoder via ggml
  • Checkpoint used: encoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl + attn_decoder_lr0.00001_GN_te1_d05_SGD_bs6_mask_conv_bn_b_xavier.pkl

Architecture

Component Details
Encoder DenseNet-121 (3 dense blocks: 6+12+24 layers, 2-channel input)
Decoder 2x GRUCell + Bahdanau attention + coverage mechanism
Parameters 6.8M (293 GGUF tensors)
Vocabulary 112 LaTeX tokens
Input Variable-size grayscale image + padding mask
Output LaTeX token sequence (greedy decoding)

Model variants

File Size Format Notes
hmer-hw-f32.gguf 26 MB F32 Full precision, verified parity with PyTorch
hmer-hw-f16.gguf 13 MB F16 Half precision
hmer-hw-q8_0.gguf 7 MB Q8_0 8-bit quantized
hmer-hw-q4_k.gguf 4 MB Q4_K 4-bit quantized, best for mobile

Supported symbols (112 tokens)

Digits: 0-9 Latin lowercase: a-z Latin uppercase: A, B, C, E, F, G, H, I, L, M, N, P, R, S, T, V, X, Y Greek: alpha, beta, gamma, delta, theta, sigma, lambda, mu, pi, phi Operators: + - = / * ! , . Relations: < > <= >= != in Functions: sin cos tan log lim Structural: frac sqrt sum int ^ _ { } ( ) [ ] | forall exists infty Other: pm times div cdot prime rightarrow ldots cdots limits

Usage with CrispEmbed

C API

#include "hmer_ocr.h"

hmer_ocr_context * ctx = hmer_ocr_init("hmer-hw-f32.gguf", 4);

// From grayscale float pixels [0,1]
int len;
const char * latex = hmer_ocr_recognize(ctx, pixels, width, height, &len);
printf("LaTeX: %s\n", latex);

hmer_ocr_free(ctx);

Dart / Flutter

import 'package:crispembed/crispembed.dart';

final ocr = CrispEmbedHmerOcr('hmer-hw-f32.gguf', nThreads: 4);
final latex = ocr.recognizeGray(grayPixels, width, height);
print(latex); // "\frac { x ^ { 2 } + 1 } { 2 }"
ocr.dispose();

Python (via ctypes)

from crispembed import CrispEmbed
ce = CrispEmbed("hmer-hw-f32.gguf")
# Use via C API bindings

How it works

  1. Image preprocessing: Grayscale input normalized to [0,1], with a binary mask channel (1=valid, 0=padded). No fixed resolution required.

  2. DenseNet-121 encoder: 3 dense blocks with bottleneck layers (BN-ReLU-Conv1x1-BN-ReLU-Conv3x3), transition layers (BN-Conv1x1-AvgPool), producing a 1024-channel spatial feature map at 16x downsampling.

  3. GRU attention decoder: Two GRU cells with Bahdanau additive attention and a coverage mechanism. At each step:

    • Embed previous token + GRU1 produces query
    • Attention computes context vector from encoder features
    • Coverage conv prevents re-attending to same regions
    • GRU2 updates hidden state
    • Linear projection produces next-token logits
  4. Greedy decoding: Argmax over 112 LaTeX tokens until <eol> or max 48 steps.

GGUF conversion

BatchNorm layers are folded at conversion time:

  • Post-conv BN (stem): folded into conv weight+bias
  • Pre-activation BN (dense layers, transitions): precomputed as scale+offset
  • Attention BN (decoder bn1): precomputed as scale+offset

This eliminates all running_mean/running_var tensors from the model.

python models/convert-hmer-to-gguf.py \
    --model-dir /path/to/Pytorch-HMER/model \
    --dict /path/to/Pytorch-HMER/dictionary.txt \
    --output hmer-hw-f32.gguf

Training data

Trained on CROHME 2016 (Competition on Recognition of Online Handwritten Mathematical Expressions). The dataset contains handwritten math expressions with LaTeX ground truth annotations.

Important: image format

The model expects white strokes on black background (CROHME convention). The C++ inference layer handles this automatically:

  • Auto-inversion: if mean pixel > 0.5, image is inverted (black-on-white โ†’ white-on-black)
  • Auto-scaling: images larger than 100K pixels are scaled down with bilinear interpolation (e.g. 4000ร—3000 camera photo โ†’ 365ร—273)

Accuracy

Tested on CROHME 2016 offline test set (986 images):

  • Exact match: ~58% on a 19-sample subset
  • Bit-exact parity with the original PyTorch implementation verified
  • Common errors: confusing similar symbols (p/beta, 1/2 in subscripts, geq/z)

License

MIT (same as the original Pytorch-HMER repository).

Citation

If you use this model, please cite the original WAP paper:

@inproceedings{zhang2017watch,
  title={Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition},
  author={Zhang, Jianshu and Du, Jun and Zhang, Shiliang and Liu, Dan and Hu, Yulong and Hu, Jinshui and Wei, Si and Dai, Lirong},
  journal={Pattern Recognition},
  year={2017}
}
Downloads last month
179
GGUF
Model size
6.81M params
Architecture
hmer
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support