lora_gemma3_4b_lt_road_signs

LoRA adapter for Gemma 3 4B-IT fine-tuned to generate single-sentence Lithuanian captions for Lithuanian road signs.

Developed by Einstein Files as part of a university research project at the Faculty of Mathematics and Informatics, Vilnius University.

Model details

Property Value
Base model google/gemma-3-4b-it
Fine-tuning method LoRA (r=8, α=16)
Fine-tuned modules Attention + MLP (language decoder only)
Quantization NF4 (4-bit)
Training epochs 4
Learning rate 2e-4 (cosine schedule, 3% warmup)
Effective batch size 4

Training data

159 photographs of Lithuanian road signs taken in urban environments, each paired with a human-authored Lithuanian caption. Split: 115 train / 30 val / 14 test. Captions follow a structured convention: sign category and shape, symbolic content, and background context.

Evaluation (test set, n=14)

Metric Base model This adapter Δ
chrF++ ↑ 22.09 45.31 +23.22
BERTScore-F1 ↑ 85.71 92.24 +6.53
CLIPScore ↑ 28.52 29.95 +1.43

The base model consistently produces markdown-formatted multi-paragraph explanations in mixed language. The fine-tuned adapter produces concise, plain-text, single-sentence descriptions in standard Lithuanian.

Usage

from unsloth import FastVisionModel
from PIL import Image

model, tokenizer = FastVisionModel.from_pretrained(
    "AKrasavcev/lora_gemma3_4b_lt_road_signs",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

image = Image.open("road_sign.jpg").convert("RGB")
prompt = "Aprašyk šį Lietuvos kelio ženklą vienu sakiniu lietuvių kalba."

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": prompt},
    ],
}]
inputs = tokenizer(
    image,
    tokenizer.apply_chat_template(messages, add_generation_prompt=True),
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

output = model.generate(**inputs, max_new_tokens=128, do_sample=False)
caption = tokenizer.decode(
    output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
).strip()
print(caption)

Limitations

  • Trained on a small dataset (115 images); performance on rare sign types may be unreliable.
  • Vision encoder is frozen; visual grounding improvements are limited by the base CLIP backbone.
  • Optimised for Lithuanian road sign conventions — not suitable for general image captioning.
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AKrasavcev/lora_gemma3_4b_lt_road_signs

Adapter
(383)
this model