Model Card for RoQwen3-VL-2B-Instruct

RoQwen3-VL-2B-Instruct is a Romanian-adapted vision-language model built on top of Qwen/Qwen3-VL-2B-Instruct. It was produced by continued supervised instruction tuning of the base Qwen3-VL checkpoint on a Romanian multimodal SFT mixture covering general instruction following (LLaVA mix), captioning (Pixmo-Cap, Flickr30k-Cap), visual question answering (Pixmo-AA, Pixmo-Cap-QA, Flickr30k-QA), document and chart understanding (CoSyn, FinePDFs), and visual grounding (Pixmo-Points, Pixmo-Count). The model is intended for research on Romanian VLM capabilities.

Model Details

Model Description

Model Sources

Intended Use

Intended Use Cases

RoQwen3-VL-2B-Instruct is intended for research use on Romanian vision-language tasks — captioning, visual question answering, cultural understanding, OCR / document understanding, and visual grounding — and as a starting point for further Romanian VLM adaptation.

Out-of-Scope Use

Use in any manner that violates applicable laws or regulations (including trade-compliance laws), the project's license, or use in languages other than Romanian.

How to Get Started with the Model

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "OpenLLM-Ro/RoQwen3-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()
processor = AutoProcessor.from_pretrained("OpenLLM-Ro/RoQwen3-VL-2B-Instruct")

image = Image.open("example.jpg").convert("RGB")
question = "Descrie imaginea în detaliu."

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": question},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Benchmarks

All benchmarks below are evaluated in Romanian. Per-benchmark winners are shown in bold. Micro is the mean over individual benchmarks; Macro is the mean over capability groups.

Aggregate

Model Micro avg. Macro avg.
Qwen3-VL-2B-Instruct 51.51 51.31
RoQwen3-VL-2B-Instruct 63.36 62.65

General Understanding

Model MMBench MMStar SeedBench2
Qwen3-VL-2B-Instruct 62.69 45.92 63.38
RoQwen3-VL-2B-Instruct 71.90 50.73 69.29

Knowledge & Reasoning

Model MMMU MME
Qwen3-VL-2B-Instruct 38.33 61.59
RoQwen3-VL-2B-Instruct 40.22 62.19

Cultural

Model CVQA ALM-Bench RoMemes RoCultVLM
Qwen3-VL-2B-Instruct 57.95 48.72 46.68 50.31
RoQwen3-VL-2B-Instruct 61.92 60.97 36.71 54.00

Generation & Open-ended

Model RoFlickr30k-Caption RoFlickr30k-QA LLaVA-Wild AyaVisionBench m-WildVision
Qwen3-VL-2B-Instruct 70.09 30.59 29.89 43.04 44.76
RoQwen3-VL-2B-Instruct 83.80 85.70 50.40 55.33 60.08

OCR & Documents

Model RoCosyn RoFinepdfs RoMemes OCR
Qwen3-VL-2B-Instruct 48.63 78.62 91.04
RoQwen3-VL-2B-Instruct 64.07 86.85 89.54

Grounding

Model PixmoCount PixmoPoints
Qwen3-VL-2B-Instruct 56.36 10.09
RoQwen3-VL-2B-Instruct 65.28 54.89

Citation

@misc{masala2026intelegi,
      title={``\^{I}n\c{t}elegi Rom\^{a}ne\c{s}te?'' A Recipe for Romanian Vision-Language Models},
      author={Mihai Masala and Marius Leordeanu and Mihai Dascalu and Traian Rebedea},
      year={2026},
      eprint={2605.31401},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.31401},
}

@inproceedings{masala-etal-2024-vorbesti,
    title = "``Vorbeşti Româneşte?'' A Recipe to Train Powerful {R}omanian {LLM}s with {E}nglish Instructions",
    author = "Masala, Mihai and Ilie-Ablachim, Denis and Dima, Alexandru and Corlatescu, Dragos and Zavelca, Miruna and Olaru, Ovio and Terian, Simina and Terian, Andrei and Leordeanu, Marius and Velicu, Horia and Popescu, Marius and Dascalu, Mihai and Rebedea, Traian",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    pages = "11632--11647"
}
Downloads last month
36
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenLLM-Ro/RoQwen3-VL-2B-Instruct

Finetuned
(218)
this model

Datasets used to train OpenLLM-Ro/RoQwen3-VL-2B-Instruct

Collection including OpenLLM-Ro/RoQwen3-VL-2B-Instruct

Paper for OpenLLM-Ro/RoQwen3-VL-2B-Instruct

Evaluation results