How exactly is this model used?

#1
by Aloooom - opened

I have served it on VLLM as instructed, but both with open-webui and direct python script, I couldn't successfully parse a document.
open-webui would generate straight gibberish.
direct python script wouldn't work for pdfs so I had to convert each pdf page to images and feed it to the model, but even with that, the model generated very weird repeated texts, which are a total nonsense.

I would appreciate a script or workflow of running this model being provided.

Thanks in advance.

KoreaDeepLearning org

Hi, thanks for trying it out.

The behaviour you saw is expected: this is not a single-shot end-to-end model and it does not take PDFs or free-form chat prompts. A generic prompt in open-webui will produce gibberish, and a full page sent with the wrong prompt will loop/repeat.

I've added a Usage section to the model card. The key points:

  • It runs as a pipeline: detect layout -> crop each region -> call the model again per region with a fixed, task-specific prompt (Layout / Text / Table / Formula / Figure). The prompts are listed in the card.
  • Feed page images, one per request.
  • Set enable_thinking=False in the chat template and decode with skip_special_tokens=False. The repeated-nonsense output is almost always one of these two being unset.

Please follow the prompts and flags in the card rather than a chat UI.

Sign up or log in to comment