OCR using official space is poor. User error or space error?

#3
by OJ-1 - opened

Using this exact image, I've tested a lot of different OCR models, paddle v1.5 OCR (and smaller) miss the numbers in the bottom left, whilst Z.ai OCR misses the "Loom" (Your paper needs to benchmark these models)

Using your space with either long or base, n-gram enabled, the prompt "document parsing." or "comic parsing", I get this output:

Good thing I sneaked in this early. Now I can finally secure a spot for the new "Fluffy Across the Blue" merch!
![](images/0.jpg)

![](images/1.jpg)

Can you please advise? Is this user issue and if so, how should it be approached?

OCR-test

OJ-1 changed discussion title from OCR using the official space seems to be awful? Is it a space/pipeline issue? to OCR using the official space seems to be awful? Is it a space/pipeline or user issue?
OJ-1 changed discussion title from OCR using the official space seems to be awful? Is it a space/pipeline or user issue? to OCR using official space is poor. User error or space error?

Reproduced this on my end (Q8_0 GGUF via llama.cpp). Same result: on the full image it only reads the top panel and then starts repeating itself. Not user error, it's how it handles a tall, dense page.

A few things that helped me...

  • The card lists two configs, gundam (base_size=1024, image_size=640, crop_mode=True) and base (1024/1024, no crop). gundam crops the page internally, which is what a tall multi-panel image needs. If the Space is running base mode, a 1580x3070 page gets squashed and the small text is lost. Try gundam if you can select it (crop_mode=True, image_size=640).

  • On a GGUF/llama.cpp build you can't toggle gundam, so I just split the page into 4 horizontal strips and ran each one. Got roughly 5x more text out (about 1/12 of the lines on the whole image, about 5/12 tiled). No model change, just slicing the image client side.

  • The repetition is because the official code uses a no_repeat_ngram_size=35 logit processor that some runtimes skip. Add a frequency/repetition penalty and cap max_tokens, and it stops looping.

Prompt: use "document parsing." The "comic parsing" prompt gave me the worst output by far.

Even tiled, the faint SFX (LOOM) and the tiny corner numbers stayed unreadable, probably a real limit on a low contrast page (the other OCRs you tried miss the same bits). But most of the dialogue comes back once you crop or tile instead of feeding the whole page.

It's terrible though!

Tested Unlimited-OCR against olmOCR-2, Qianfan-OCR, PaddleOCR-VL and HunyuanOCR on a hard, low res scan of a dense legal form. Same image and settings.

Fastest of the five, but the least accurate. On text it couldn't read it didn't blank it, it invented content. It got the title and structure, then wrote a confident passage about a board of directors and voting that isn't in the document.

olmOCR-2 and Qianfan-OCR stayed faithful. PaddleOCR-VL and HunyuanOCR garbled words but stayed close. Faithfulness order: olmOCR-2 > Qianfan-OCR > PaddleOCR-VL > HunyuanOCR > Unlimited-OCR.

Speed is great and it's fine on clean pages. But hallucinating plausible text instead of flagging unreadable input is the dangerous failure for legal, medical or finance work. One document, not a benchmark, but it matches the tall comic thread: when it can't read the pixels, it invents.

You cant select gundam from the space and given your testing and my results we can conclude its both the space issue (and as an aside the model isnt great)

Sign up or log in to comment