How exactly is this model used?
I have served it on VLLM as instructed, but both with open-webui and direct python script, I couldn't successfully parse a document.
open-webui would generate straight gibberish.
direct python script wouldn't work for pdfs so I had to convert each pdf page to images and feed it to the model, but even with that, the model generated very weird repeated texts, which are a total nonsense.
I would appreciate a script or workflow of running this model being provided.
Thanks in advance.
Hi, thanks for trying it out.
The behaviour you saw is expected: this is not a single-shot end-to-end model and it does not take PDFs or free-form chat prompts. A generic prompt in open-webui will produce gibberish, and a full page sent with the wrong prompt will loop/repeat.
I've added a Usage section to the model card. The key points:
- It runs as a pipeline: detect layout -> crop each region -> call the model again per region with a fixed, task-specific prompt (Layout / Text / Table / Formula / Figure). The prompts are listed in the card.
- Feed page images, one per request.
- Set
enable_thinking=Falsein the chat template and decode withskip_special_tokens=False. The repeated-nonsense output is almost always one of these two being unset.
Please follow the prompts and flags in the card rather than a chat UI.