Instructions to use avishadilhara/sinhala-lightonocr-2-1b-Qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use avishadilhara/sinhala-lightonocr-2-1b-Qlora with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="avishadilhara/sinhala-lightonocr-2-1b-Qlora")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("avishadilhara/sinhala-lightonocr-2-1b-Qlora", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Sinhala LightOnOCR-2-1B QLoRA Model π±π°
Fine-tuned LightOnOCR-2-1B model for high-accuracy Sinhala language OCR on historical legal documents
π Quick Start β’ π Performance β’ π Usage β’ π§ Training β’ π Citation
π Model Description
This model is a QLoRA fine-tuned version of LightOnOCR-2-1B specifically optimized for Sinhala (ΰ·ΰ·ΰΆΰ·ΰΆ½) language OCR on historical and contemporary legal documents. The model achieves 98.95% character accuracy on a test set spanning over a century of Sri Lankan legal texts (1981-2019).
Key Features
- π― High Accuracy: 98.95% character accuracy on Sinhala legal documents
- π Historical Coverage: Evaluated on documents from 1981-2019
- β‘ Efficient: QLoRA fine-tuning with 4-bit quantization (~3.67% trainable parameters)
- π₯οΈ Optimized: Trained on NVIDIA RTX 4080 SUPER
- πΎ Low Resource: Runs on consumer GPUs with 4-bit quantization
- π Flexible Loading: Supports both QLoRA (4-bit) and standard LoRA (full-precision) inference
Model Details
| Property | Value |
|---|---|
| Base Model | lightonai/LightOnOCR-2-1B |
| Model Type | Vision-Language Model (VLM) |
| Fine-tuning Method | QLoRA (4-bit NF4 quantization + LoRA) |
| Language | Sinhala (ΰ·ΰ·ΰΆΰ·ΰΆ½) |
| License | Apache 2.0 |
| Total Parameters | ~1.04B (base) |
| Trainable Parameters | 38.27M (3.67%) |
| Precision | 4-bit quantized (NF4) |
π Performance Metrics
Overall Performance (202 Test Samples)
| Metric | Score | Description |
|---|---|---|
| Character Accuracy | 98.95% | Percentage of correctly recognized characters |
| CER (Character Error Rate) | 0.0105 | Lower is better (0 = perfect) |
| WER (Word Error Rate) | 0.0563 | Word-level error rate |
| BLEU Score | 0.9808 | Text similarity score (0-1) |
| ANLS | 0.9895 | Average Normalized Levenshtein Similarity |
| METEOR | 0.9492 | Semantic similarity score |
Summary Statistics
| Statistic | Value |
|---|---|
| Median Accuracy | 99.42% |
| Std Dev Accuracy | 1.34% |
| Samples β₯ 90% accuracy | 201/202 (99.5%) |
| Samples β₯ 80% accuracy | 202/202 (100%) |
| Samples < 50% accuracy | 0/202 (0%) |
π Quick Start
Installation
pip install transformers==5.0.0 peft bitsandbytes Pillow
Option 1: QLoRA Inference (4-bit Quantized β Recommended for Low VRAM)
Load the base model with 4-bit quantization and apply the LoRA adapter on top. This matches the original training setup and requires ~2-3 GB VRAM.
import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor, BitsAndBytesConfig
from peft import PeftModel
from PIL import Image
# Configuration
BASE_MODEL_ID = "lightonai/LightOnOCR-2-1B"
ADAPTER_ID = "avishadilhara/sinhala-lightonocr-2-1b-Qlora"
LONGEST_EDGE = 1540
# Load processor
processor = LightOnOcrProcessor.from_pretrained(ADAPTER_ID)
processor.tokenizer.padding_side = "left"
# Load base model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = LightOnOcrForConditionalGeneration.from_pretrained(
BASE_MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config
)
# Load QLoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_ID)
model.eval()
# Run inference
image = Image.open("your_image.png").convert("RGB")
messages = [
{"role": "user", "content": [{"type": "image"}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=text,
images=[image],
return_tensors="pt",
size={"longest_edge": LONGEST_EDGE},
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False,
)
result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)
Option 2: LoRA Inference (Full Precision β Higher Quality)
Load the base model in full precision (bf16) and apply the LoRA adapter. No quantization β better quality but requires ~4-5 GB VRAM.
import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
from peft import PeftModel
from PIL import Image
# Configuration
BASE_MODEL_ID = "lightonai/LightOnOCR-2-1B"
ADAPTER_ID = "avishadilhara/sinhala-lightonocr-2-1b-Qlora"
LONGEST_EDGE = 1540
# Load processor
processor = LightOnOcrProcessor.from_pretrained(ADAPTER_ID)
processor.tokenizer.padding_side = "left"
# Load base model in full precision (no quantization)
model = LightOnOcrForConditionalGeneration.from_pretrained(
BASE_MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load LoRA adapter (same weights, no quantization on base)
model = PeftModel.from_pretrained(model, ADAPTER_ID)
model.eval()
# Run inference
image = Image.open("your_image.png").convert("RGB")
messages = [
{"role": "user", "content": [{"type": "image"}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=text,
images=[image],
return_tensors="pt",
size={"longest_edge": LONGEST_EDGE},
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False,
)
result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)
Note: Both options use the same LoRA adapter weights. The difference is whether the base model is quantized (QLoRA) or loaded in full precision (LoRA). QLoRA uses less VRAM; LoRA may give slightly better quality.
π§ Training Details
Dataset
| Split | Samples |
|---|---|
| Train | 707 |
| Validation | 101 |
| Test | 202 |
| Total | 1010 |
Dataset: avishadilhara/sinhala-ocr-lk-acts-1010
QLoRA Configuration
| Parameter | Value |
|---|---|
| LoRA Rank (r) | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.1 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Task Type | CAUSAL_LM |
| Quantization | 4-bit NF4 with double quantization |
| Compute dtype | bfloat16 |
Training Arguments
| Parameter | Value |
|---|---|
| Max Epochs | 20 (early stopped at 4) |
| Batch Size | 4 |
| Learning Rate | 2e-4 (linear schedule) |
| Warmup Steps | 10 |
| Weight Decay | 0.001 |
| Max Grad Norm | 1.0 |
| Optimizer | AdamW (fused) |
| Precision | bf16 |
| Early Stopping | patience=1 |
| Image Size | longest_edge=1540 |
| Max Length | 4096 tokens |
Training Loss
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | 0.0336 | 0.0341 |
| 2 | 0.0284 | 0.0277 |
| 3 | 0.0205 | 0.0234 |
| 4 | 0.0139 | 0.0248 |
Best model selected at epoch 3 (lowest validation loss).
Hardware
- GPU: NVIDIA RTX 4080 SUPER
- Training Time: ~3 hours (4 epochs)
π Citation
If you use this model, please cite:
@misc{sinhala-lightonocr-2-1b-Qlora,
author = {Avisha Dilhara},
title = {Sinhala LightOnOCR-2-1B QLoRA: Fine-tuned OCR for Sinhala Legal Documents},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/avishadilhara/sinhala-lightonocr-2-1b-Qlora}
}
Model tree for avishadilhara/sinhala-lightonocr-2-1b-Qlora
Base model
lightonai/LightOnOCR-2-1BDataset used to train avishadilhara/sinhala-lightonocr-2-1b-Qlora
Evaluation results
- Character Accuracy on Sinhala Legal Acts OCRself-reported98.950
- Character Error Rate on Sinhala Legal Acts OCRself-reported0.011
- BLEU Score on Sinhala Legal Acts OCRself-reported0.981