lora_gemma3_4b_lt_road_signs

LoRA adapter for Gemma 3 4B-IT fine-tuned to generate single-sentence Lithuanian captions for Lithuanian road signs.

Developed by Einstein Files as part of a university research project at the Faculty of Mathematics and Informatics, Vilnius University.

Model details

Property	Value
Base model	`google/gemma-3-4b-it`
Fine-tuning method	LoRA (r=8, α=16)
Fine-tuned modules	Attention + MLP (language decoder only)
Quantization	NF4 (4-bit)
Training epochs	4
Learning rate	2e-4 (cosine schedule, 3% warmup)
Effective batch size	4

Training data

159 photographs of Lithuanian road signs taken in urban environments, each paired with a human-authored Lithuanian caption. Split: 115 train / 30 val / 14 test. Captions follow a structured convention: sign category and shape, symbolic content, and background context.

Evaluation (test set, n=14)

Metric	Base model	This adapter	Δ
chrF++ ↑	22.09	45.31	+23.22
BERTScore-F1 ↑	85.71	92.24	+6.53
CLIPScore ↑	28.52	29.95	+1.43

The base model consistently produces markdown-formatted multi-paragraph explanations in mixed language. The fine-tuned adapter produces concise, plain-text, single-sentence descriptions in standard Lithuanian.

Usage

from unsloth import FastVisionModel
from PIL import Image

model, tokenizer = FastVisionModel.from_pretrained(
    "AKrasavcev/lora_gemma3_4b_lt_road_signs",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

image = Image.open("road_sign.jpg").convert("RGB")
prompt = "Aprašyk šį Lietuvos kelio ženklą vienu sakiniu lietuvių kalba."

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": prompt},
    ],
}]
inputs = tokenizer(
    image,
    tokenizer.apply_chat_template(messages, add_generation_prompt=True),
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

output = model.generate(**inputs, max_new_tokens=128, do_sample=False)
caption = tokenizer.decode(
    output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
).strip()
print(caption)

Limitations

Trained on a small dataset (115 images); performance on rare sign types may be unreliable.
Vision encoder is frozen; visual grounding improvements are limited by the base CLIP backbone.
Optimised for Lithuanian road sign conventions — not suitable for general image captioning.

Downloads last month: 34

Model tree for AKrasavcev/lora_gemma3_4b_lt_road_signs

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Adapter

(383)

this model