Instructions to use Ethosoft/trdocvqa-paligemma-3b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Ethosoft/trdocvqa-paligemma-3b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/paligemma-3b-pt-224") model = PeftModel.from_pretrained(base_model, "Ethosoft/trdocvqa-paligemma-3b-lora") - Notebooks
- Google Colab
- Kaggle
TR-DocVQA PaliGemma-3B LoRA
This repository contains a LoRA adapter fine-tuned for Turkish document visual question answering on TR-DocVQA-Synth.
The adapter is trained on top of:
google/paligemma-3b-pt-224
It is intended for OCR-free document question answering from Turkish document images such as invoices, contracts, and offers.
Evaluation
Evaluation was run on the TR-DocVQA-Synth test split with 2,000 examples.
| Model | Setting | Test Samples | Normalized EM | ANLS | Token F1 | Empty Prediction Rate | Invalid Prediction Rate |
|---|---|---|---|---|---|---|---|
| PaliGemma-3B LoRA | Fine-tuned LoRA | 2000 | 0.7205 | 0.8745 | 0.7294 | 0.0000 | 0.0000 |
Additional paper-ready metrics, per-field breakdowns, error analysis, and LaTeX tables are included under evaluation/.
Usage
import torch
from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
base_model = "google/paligemma-3b-pt-224"
adapter_id = "omerfaksoy/trdocvqa-paligemma-3b-lora"
processor = AutoProcessor.from_pretrained(adapter_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
image = Image.open("document.png").convert("RGB")
question = "Toplam tutar nedir?"
prompt = f"answer tr {question}\n"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
generated = model.generate(**inputs, max_new_tokens=64, do_sample=False, num_beams=1)
prompt_len = inputs["input_ids"].shape[-1]
answer = processor.batch_decode(generated[:, prompt_len:], skip_special_tokens=True)[0].strip()
print(answer)
Training Summary
- Method: LoRA fine-tuning
- Base model:
google/paligemma-3b-pt-224 - Dataset:
Ethosoft/TR-DocVQA-Synth - Language: Turkish
- Input: document image + Turkish question
- Output: short answer text
- Hardware used: TRUBA Kolyoz H200
Important Notes
This repository contains a LoRA adapter, not a full merged copy of the base PaliGemma model. Users must comply with the terms of the base model and accept any gated access requirements for google/paligemma-3b-pt-224.
The model was developed for research use on synthetic Turkish document VQA data. Before production use, evaluate on real documents from the target domain and review privacy, licensing, and bias considerations.
- Downloads last month
- 7
Model tree for Ethosoft/trdocvqa-paligemma-3b-lora
Base model
google/paligemma-3b-pt-224