Model Card for RoQwen2-VL-2B-Instruct

RoQwen2-VL-2B-Instruct is a Romanian-adapted vision-language model built on top of Qwen/Qwen2-VL-2B-Instruct. It was produced by continued supervised instruction tuning of the base Qwen2-VL checkpoint on a Romanian multimodal SFT mixture covering general instruction following (LLaVA mix), captioning (Pixmo-Cap, Flickr30k-Cap), visual question answering (Pixmo-AA, Pixmo-Cap-QA, Flickr30k-QA), document and chart understanding (CoSyn, FinePDFs), and visual grounding (Pixmo-Points, Pixmo-Count). The model is intended for research on Romanian VLM capabilities.

Model Details

Model Description

Model Sources

Intended Use

Intended Use Cases

RoQwen2-VL-2B-Instruct is intended for research use on Romanian vision-language tasks — captioning, visual question answering, cultural understanding, OCR / document understanding, and visual grounding — and as a starting point for further Romanian VLM adaptation.

Out-of-Scope Use

Use in any manner that violates applicable laws or regulations (including trade-compliance laws), the project's license, or use in languages other than Romanian.

How to Get Started with the Model

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "OpenLLM-Ro/RoQwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()
processor = AutoProcessor.from_pretrained("OpenLLM-Ro/RoQwen2-VL-2B-Instruct")

image = Image.open("example.jpg").convert("RGB")
question = "Descrie imaginea în detaliu."

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": question},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Benchmarks

All benchmarks below are evaluated in Romanian. Per-benchmark winners are shown in bold. Micro is the mean over individual benchmarks; Macro is the mean over capability groups.

Aggregate

Model Micro avg. Macro avg.
Qwen2-VL-2B-Instruct 41.39 40.56
RoQwen2-VL-2B-Instruct 59.05 57.88

General Understanding

Model MMBench MMStar SeedBench2
Qwen2-VL-2B-Instruct 51.47 37.94 52.54
RoQwen2-VL-2B-Instruct 65.25 40.50 65.84

Knowledge & Reasoning

Model MMMU MME
Qwen2-VL-2B-Instruct 34.56 31.22
RoQwen2-VL-2B-Instruct 36.78 49.10

Cultural

Model CVQA ALM-Bench RoMemes RoCultVLM
Qwen2-VL-2B-Instruct 57.62 37.57 30.77 46.24
RoQwen2-VL-2B-Instruct 65.56 60.88 34.00 56.09

Generation & Open-ended

Model RoFlickr30k-Caption RoFlickr30k-QA LLaVA-Wild AyaVisionBench m-WildVision
Qwen2-VL-2B-Instruct 64.55 45.64 21.17 24.44 29.66
RoQwen2-VL-2B-Instruct 83.69 82.92 42.49 45.70 55.92

OCR & Documents

Model RoCosyn RoFinepdfs RoMemes OCR
Qwen2-VL-2B-Instruct 41.01 37.53 86.34
RoQwen2-VL-2B-Instruct 57.22 85.50 83.74

Grounding

Model PixmoCount PixmoPoints
Qwen2-VL-2B-Instruct 45.73 10.39
RoQwen2-VL-2B-Instruct 63.76 47.03

Citation

@misc{masala2026intelegi,
      title={``\^{I}n\c{t}elegi Rom\^{a}ne\c{s}te?'' A Recipe for Romanian Vision-Language Models},
      author={Mihai Masala and Marius Leordeanu and Mihai Dascalu and Traian Rebedea},
      year={2026},
      eprint={2605.31401},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.31401},
}

@inproceedings{masala-etal-2024-vorbesti,
    title = "``Vorbeşti Româneşte?'' A Recipe to Train Powerful {R}omanian {LLM}s with {E}nglish Instructions",
    author = "Masala, Mihai and Ilie-Ablachim, Denis and Dima, Alexandru and Corlatescu, Dragos and Zavelca, Miruna and Olaru, Ovio and Terian, Simina and Terian, Andrei and Leordeanu, Marius and Velicu, Horia and Popescu, Marius and Dascalu, Mihai and Rebedea, Traian",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    pages = "11632--11647"
}
Downloads last month
31
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenLLM-Ro/RoQwen2-VL-2B-Instruct

Base model

Qwen/Qwen2-VL-2B
Finetuned
(351)
this model

Datasets used to train OpenLLM-Ro/RoQwen2-VL-2B-Instruct

Collection including OpenLLM-Ro/RoQwen2-VL-2B-Instruct

Paper for OpenLLM-Ro/RoQwen2-VL-2B-Instruct

Evaluation results