Model ID: `sapkotapraful/FullyOCR-2-merged`

Model Description

FullyOCR-2 is a multimodal vision-language model designed for document understanding and OCR-style extraction from images. The model accepts an image and instruction prompt and generates structured textual representations of the document content, such as Markdown.

It is optimized for:

Document OCR
Structured document extraction
Markdown reconstruction
Table extraction
Layout-aware text generation

The model uses a chat-style multimodal interface where the user message contains an image and a textual instruction.

Quick Start

Install dependencies:

pip install torch transformers pillow

Example Usage

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "sapkotapraful/FullyOCR-2-merged"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else (
    torch.float16 if device == "cuda" else torch.float32
)

# Load model
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    torch_dtype=dtype,
    device_map="auto" if device == "cuda" else None,
    trust_remote_code=True,
)
model.eval()

# Load tokenizer + processor
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

# Load image
image = Image.open("document.png").convert("RGB")

instruction = "<|MD|>"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": instruction},
        ],
    }
]

input_text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    images=image,
    text=input_text,
    return_tensors="pt",
)

inputs = {
    k: v.to(model.device) if hasattr(v, "to") else v
    for k, v in inputs.items()
}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        num_beams=1,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
    )

decoded = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
extracted = decoded.split(instruction)[-1].strip()

print(extracted)

Citation

If you use this model in research or applications:

@model{fullyocr2,
  title = {FullyOCR-2: Multimodal Document OCR Model},
  author = {Praful Sapkota},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/sapkotapraful/FullyOCR-2-merged}
}

Author

Praful Sapkota

Hugging Face: https://huggingface.co/sapkotapraful

Downloads last month: 3

Safetensors

Model size

0.9B params

Tensor type

F16

Inference Providers NEW