Image-to-Text
Transformers
PyTorch
Korean
vision-encoder-decoder
image-text-to-text
donut
document-understanding
ocr-free
korean
Instructions to use ksk00/donut-docai with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ksk00/donut-docai with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="ksk00/donut-docai")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("ksk00/donut-docai") model = AutoModelForMultimodalLM.from_pretrained("ksk00/donut-docai") - Notebooks
- Google Colab
- Kaggle
donut-docai โ Korean Transaction-Statement Parser
Fine-tuned Donut
(naver-clova-ix/donut-base) that reads a Korean transaction statement
(๊ฑฐ๋๋ช
์ธํ / ๊ณ์ฐ์) image and outputs structured JSON โ no OCR + rule engine.
Code & full pipeline: https://github.com/KyoungsoonKim00/donut-document-ai
Usage
import torch
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel
processor = DonutProcessor.from_pretrained("ksk00/donut-docai")
model = VisionEncoderDecoderModel.from_pretrained("ksk00/donut-docai")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()
image = Image.open("document.png").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)
decoder_input_ids = processor.tokenizer(
"<s_gt_parse>", return_tensors="pt", add_special_tokens=False
).input_ids.to(device)
outputs = model.generate(
pixel_values, decoder_input_ids=decoder_input_ids,
max_length=512, num_beams=5,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
Output schema
| Group | Fields |
|---|---|
์๋ฅํน์ฑ.* |
์๋ฅ์ข ๋ฅ, ๊ฑฐ๋์ผ, ํฉ๊ณ๊ธ์ก |
ํผ๊ณต๊ธ์.* |
์ด๋ฆ, ๊ฑฐ๋์ ๋ฏธ์ง๊ธ๊ธ, ์ ๊ธ์ก, ํ์์ก |
ํ๋ชฉ.* |
ํ๋ชฉ๋ช , ์ฝ๋, ๋จ์, ์๋, ๋จ๊ฐ, ๊ณต๊ธ๊ฐ์ก, ์ธ์ก, ์๋ํฉ๊ณ, ๊ณต๊ธ๊ฐ์กํฉ๊ณ, ์ธ์กํฉ๊ณ |
Training
- Base:
naver-clova-ix/donut-base(Swin-B encoder + mBART decoder) - Image size 720ร960, task prompt
<s_gt_parse>, max length 512 - AdamW lr 5e-5, weight decay 0.01, warmup 5%, 15 epochs, fp16, gradient checkpointing
Limitations
Trained on a small in-house dataset (tens of documents). The model overfits and can collapse into repeated tokens on unseen layouts. Treat as a proof-of-concept, not production-ready. See the GitHub repo for improvement directions.
- Downloads last month
- 49
Model tree for ksk00/donut-docai
Base model
naver-clova-ix/donut-base