How exactly is this model used?

by Aloooom - opened 3 days ago

I have served it on VLLM as instructed, but both with open-webui and direct python script, I couldn't successfully parse a document.
open-webui would generate straight gibberish.
direct python script wouldn't work for pdfs so I had to convert each pdf page to images and feed it to the model, but even with that, the model generated very weird repeated texts, which are a total nonsense.

I would appreciate a script or workflow of running this model being provided.

Thanks in advance.

Haigini

KoreaDeepLearning org about 24 hours ago

Hi, thanks for trying it out.

The behaviour you saw is expected: this is not a single-shot end-to-end model and it does not take PDFs or free-form chat prompts. A generic prompt in open-webui will produce gibberish, and a full page sent with the wrong prompt will loop/repeat.

I've added a Usage section to the model card. The key points:

It runs as a pipeline: detect layout -> crop each region -> call the model again per region with a fixed, task-specific prompt (Layout / Text / Table / Formula / Figure). The prompts are listed in the card.
Feed page images, one per request.
Set enable_thinking=False in the chat template and decode with skip_special_tokens=False. The repeated-nonsense output is almost always one of these two being unset.

Please follow the prompts and flags in the card rather than a chat UI.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment