YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model ID: sapkotapraful/FullyOCR-2-merged
Model Description
FullyOCR-2 is a multimodal vision-language model designed for document understanding and OCR-style extraction from images. The model accepts an image and instruction prompt and generates structured textual representations of the document content, such as Markdown.
It is optimized for:
- Document OCR
- Structured document extraction
- Markdown reconstruction
- Table extraction
- Layout-aware text generation
The model uses a chat-style multimodal interface where the user message contains an image and a textual instruction.
Quick Start
Install dependencies:
pip install torch transformers pillow
Example Usage
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
MODEL_ID = "sapkotapraful/FullyOCR-2-merged"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else (
torch.float16 if device == "cuda" else torch.float32
)
# Load model
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
torch_dtype=dtype,
device_map="auto" if device == "cuda" else None,
trust_remote_code=True,
)
model.eval()
# Load tokenizer + processor
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
# Load image
image = Image.open("document.png").convert("RGB")
instruction = "<|MD|>"
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": instruction},
],
}
]
input_text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
images=image,
text=input_text,
return_tensors="pt",
)
inputs = {
k: v.to(model.device) if hasattr(v, "to") else v
for k, v in inputs.items()
}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False,
num_beams=1,
use_cache=True,
pad_token_id=tokenizer.pad_token_id,
)
decoded = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
extracted = decoded.split(instruction)[-1].strip()
print(extracted)
Citation
If you use this model in research or applications:
@model{fullyocr2,
title = {FullyOCR-2: Multimodal Document OCR Model},
author = {Praful Sapkota},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/sapkotapraful/FullyOCR-2-merged}
}
Author
Praful Sapkota
Hugging Face: https://huggingface.co/sapkotapraful
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support