Instructions to use OpenLLM-Ro/RoQwen2-VL-2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenLLM-Ro/RoQwen2-VL-2B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OpenLLM-Ro/RoQwen2-VL-2B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("OpenLLM-Ro/RoQwen2-VL-2B-Instruct") model = AutoModelForImageTextToText.from_pretrained("OpenLLM-Ro/RoQwen2-VL-2B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenLLM-Ro/RoQwen2-VL-2B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenLLM-Ro/RoQwen2-VL-2B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Ro/RoQwen2-VL-2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/OpenLLM-Ro/RoQwen2-VL-2B-Instruct
- SGLang
How to use OpenLLM-Ro/RoQwen2-VL-2B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenLLM-Ro/RoQwen2-VL-2B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Ro/RoQwen2-VL-2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenLLM-Ro/RoQwen2-VL-2B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Ro/RoQwen2-VL-2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use OpenLLM-Ro/RoQwen2-VL-2B-Instruct with Docker Model Runner:
docker model run hf.co/OpenLLM-Ro/RoQwen2-VL-2B-Instruct
Model Card for RoQwen2-VL-2B-Instruct
RoQwen2-VL-2B-Instruct is a Romanian-adapted vision-language model built on top of Qwen/Qwen2-VL-2B-Instruct. It was produced by continued supervised instruction tuning of the base Qwen2-VL checkpoint on a Romanian multimodal SFT mixture covering general instruction following (LLaVA mix), captioning (Pixmo-Cap, Flickr30k-Cap), visual question answering (Pixmo-AA, Pixmo-Cap-QA, Flickr30k-QA), document and chart understanding (CoSyn, FinePDFs), and visual grounding (Pixmo-Points, Pixmo-Count). The model is intended for research on Romanian VLM capabilities.
Model Details
Model Description
- Developed by: OpenLLM-Ro
- Language(s): Romanian
- License: cc-by-nc-4.0
- Finetuned from model: Qwen/Qwen2-VL-2B-Instruct
- Trained using:
- OpenLLM-Ro/ro_sft_laion
- OpenLLM-Ro/ro_sft_pixmo_cap
- OpenLLM-Ro/ro_sft_flickr30k_cap
- OpenLLM-Ro/ro_sft_llava_mix
- OpenLLM-Ro/ro_sft_pixmo_aa
- OpenLLM-Ro/ro_sft_pixmo_cap_qa
- OpenLLM-Ro/ro_sft_flickr30k_qa
- OpenLLM-Ro/ro_sft_cosyn
- OpenLLM-Ro/ro_sft_finepdfs
- OpenLLM-Ro/ro_sft_pixmo_points
- OpenLLM-Ro/ro_sft_pixmo_count
Model Sources
- Repository: https://github.com/OpenLLM-Ro/LLaMA-Factory
- Paper: https://arxiv.org/abs/2605.31401
Intended Use
Intended Use Cases
RoQwen2-VL-2B-Instruct is intended for research use on Romanian vision-language tasks — captioning, visual question answering, cultural understanding, OCR / document understanding, and visual grounding — and as a starting point for further Romanian VLM adaptation.
Out-of-Scope Use
Use in any manner that violates applicable laws or regulations (including trade-compliance laws), the project's license, or use in languages other than Romanian.
How to Get Started with the Model
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained(
"OpenLLM-Ro/RoQwen2-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
).eval()
processor = AutoProcessor.from_pretrained("OpenLLM-Ro/RoQwen2-VL-2B-Instruct")
image = Image.open("example.jpg").convert("RGB")
question = "Descrie imaginea în detaliu."
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": question},
]},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Benchmarks
All benchmarks below are evaluated in Romanian. Per-benchmark winners are shown in bold. Micro is the mean over individual benchmarks; Macro is the mean over capability groups.
Aggregate
| Model | Micro avg. | Macro avg. |
|---|---|---|
| Qwen2-VL-2B-Instruct | 41.39 | 40.56 |
| RoQwen2-VL-2B-Instruct | 59.05 | 57.88 |
General Understanding
| Model | MMBench | MMStar | SeedBench2 |
|---|---|---|---|
| Qwen2-VL-2B-Instruct | 51.47 | 37.94 | 52.54 |
| RoQwen2-VL-2B-Instruct | 65.25 | 40.50 | 65.84 |
Knowledge & Reasoning
| Model | MMMU | MME |
|---|---|---|
| Qwen2-VL-2B-Instruct | 34.56 | 31.22 |
| RoQwen2-VL-2B-Instruct | 36.78 | 49.10 |
Cultural
| Model | CVQA | ALM-Bench | RoMemes | RoCultVLM |
|---|---|---|---|---|
| Qwen2-VL-2B-Instruct | 57.62 | 37.57 | 30.77 | 46.24 |
| RoQwen2-VL-2B-Instruct | 65.56 | 60.88 | 34.00 | 56.09 |
Generation & Open-ended
| Model | RoFlickr30k-Caption | RoFlickr30k-QA | LLaVA-Wild | AyaVisionBench | m-WildVision |
|---|---|---|---|---|---|
| Qwen2-VL-2B-Instruct | 64.55 | 45.64 | 21.17 | 24.44 | 29.66 |
| RoQwen2-VL-2B-Instruct | 83.69 | 82.92 | 42.49 | 45.70 | 55.92 |
OCR & Documents
| Model | RoCosyn | RoFinepdfs | RoMemes OCR |
|---|---|---|---|
| Qwen2-VL-2B-Instruct | 41.01 | 37.53 | 86.34 |
| RoQwen2-VL-2B-Instruct | 57.22 | 85.50 | 83.74 |
Grounding
| Model | PixmoCount | PixmoPoints |
|---|---|---|
| Qwen2-VL-2B-Instruct | 45.73 | 10.39 |
| RoQwen2-VL-2B-Instruct | 63.76 | 47.03 |
Citation
@misc{masala2026intelegi,
title={``\^{I}n\c{t}elegi Rom\^{a}ne\c{s}te?'' A Recipe for Romanian Vision-Language Models},
author={Mihai Masala and Marius Leordeanu and Mihai Dascalu and Traian Rebedea},
year={2026},
eprint={2605.31401},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.31401},
}
@inproceedings{masala-etal-2024-vorbesti,
title = "``Vorbeşti Româneşte?'' A Recipe to Train Powerful {R}omanian {LLM}s with {E}nglish Instructions",
author = "Masala, Mihai and Ilie-Ablachim, Denis and Dima, Alexandru and Corlatescu, Dragos and Zavelca, Miruna and Olaru, Ovio and Terian, Simina and Terian, Andrei and Leordeanu, Marius and Velicu, Horia and Popescu, Marius and Dascalu, Mihai and Rebedea, Traian",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
pages = "11632--11647"
}
- Downloads last month
- 31
Model tree for OpenLLM-Ro/RoQwen2-VL-2B-Instruct
Datasets used to train OpenLLM-Ro/RoQwen2-VL-2B-Instruct
OpenLLM-Ro/ro_sft_pixmo_cap
OpenLLM-Ro/ro_sft_cosyn
Collection including OpenLLM-Ro/RoQwen2-VL-2B-Instruct
Paper for OpenLLM-Ro/RoQwen2-VL-2B-Instruct
Evaluation results
- Micro avg. on Romanian_VLM_Benchmarksself-reported59.050
- Macro avg. on Romanian_VLM_Benchmarksself-reported57.880
- Accuracy on MMBenchself-reported65.250
- Accuracy on MMStarself-reported40.500
- Accuracy on SeedBench2self-reported65.840
- Accuracy on MMMUself-reported36.780
- Accuracy on MMEself-reported49.100
- Accuracy on CVQAself-reported65.560