Baseer OCR V1.0 - Arabic Document OCR Model
Baseer OCR V1.0 is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct specifically designed for Optical Character Recognition (OCR) on complex Arabic legal documents.
π― Model Overview
| Attribute | Value |
|---|---|
| Base Model | Qwen2-VL-2B-Instruct |
| Fine-tuning Method | LoRA (Rank: 48) |
| Training Framework | LlamaFactory |
| Model Type | Vision-Language Model (VLM) |
| License | Apache 2.0 |
| Release Date | March 2025 |
π Model Description
Baseer OCR (Ψ¨Ψ΅ΩΨ± - meaning "reader" in Arabic) is specialized for extracting text from Arabic legal documents with complex layouts. This model builds upon Qwen2-VL's strong vision-language capabilities and is fine-tuned to handle:
- π Arabic Legal Documents - Contracts, court documents, official papers
- π€ Complex Arabic Typography - Various fonts, sizes, and styles
- π Multi-column Layouts - Documents with complex structuring
- πΌοΈ Mixed Content - Documents with tables, stamps, and annotations
- π Handwritten & Printed Text - Both typed and handwritten Arabic
Key Features
- JSON Structured Output - Returns extracted text in structured JSON format
- Arabic Language Support - Optimized for Arabic script recognition
- Vision-Language Understanding - Combines visual understanding with language generation
- Instruction Following - Responds to detailed extraction prompts
π Quick Start
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# Load the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"AbdoTarek/Baseer-OCR-V1.0",
torch_dtype="auto",
device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained("AbdoTarek/Baseer-OCR-V1.0")
# Prepare the image
image_path = "path_to_your_arabic_document.jpg"
image = Image.open(image_path)
# Create the prompt
prompt = """Extract ALL visible text from the document image.
Return the result strictly as JSON with this structure:
{
"subject": "",
"keywords": [],
"full_text": ""
}
Rules:
- Do not repeat lines.
- Preserve original order of text.
- Do not add explanations."""
# Prepare messages
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt}
]
}
]
# Process and generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
image_inputs, _ = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt"
).to(model.device)
with torch.inference_mode():
output_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)[0]
print(output_text)
π Training Details
Dataset
- Training Samples: Custom Arabic legal documents dataset
- Data Format: ShareGPT format with image paths and OCR annotations
- Preprocessing: Images resized to 512x512 pixels
Training Configuration
| Parameter | Value |
|---|---|
| Stage | SFT (Supervised Fine-tuning) |
| Fine-tuning Type | LoRA |
| LoRA Rank | 48 |
| LoRA Dropout | 0.05 |
| LoRA Target | all |
| Learning Rate | 1e-4 |
| Epochs | 8 |
| Batch Size | 1 |
| Gradient Accumulation | 32 |
| Warmup Ratio | 0.1 |
| Scheduler | Cosine |
| Precision | BF16 |
Hardware Requirements
- GPU: Recommended 16GB+ VRAM (tested with ~8GB)
- RAM: 16GB+ system RAM
- Storage: 5GB+ for model files
π» Usage Examples
Example 1: Extracting Text from Legal Contract
# Input: Image of Arabic legal contract
# Output: Structured JSON with subject, keywords, and full text
Example 2: Batch Processing
# Process multiple document images in a directory
import os
from glob import glob
image_files = glob("documents/*.jpg")
for img_path in image_files:
# Process each image...
π¬ Model Performance
The model is optimized for Arabic legal document OCR and provides:
- High accuracy on printed Arabic text
- Structured JSON output for easy integration
- Keyword extraction for document classification
- Subject identification for document categorization
π Prompt Templates
The model responds well to detailed prompts. Recommended prompt structure:
Extract ALL visible text from the document image.
Return the result strictly as JSON with this structure:
{
"subject": "",
"keywords": [],
"full_text": ""
}
Rules:
- Do not repeat lines.
- Preserve original order of text.
- Do not add explanations.
π Related Models
- Qwen/Qwen2-VL-2B-Instruct - Base model
- AbdoTarek/ocr-models-Qwen2-VL-2B-Instruct-V3.0 - LoRA checkpoints
π License
This model is released under the Apache 2.0 license.
π Acknowledgments
- Qwen Team for the base Qwen2-VL model
- LlamaFactory for the fine-tuning framework
- Hugging Face for the model hub infrastructure
π§ Contact
- Author: Abdelrhman Tarek
- HF ID: AbdoTarek
- Model Page: Baseer-OCR-V1.0
Made with β€οΈ for Arabic OCR
- Downloads last month
- 70