Instructions to use VLAI-AIVN/llava-vietnamese-aio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VLAI-AIVN/llava-vietnamese-aio with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VLAI-AIVN/llava-vietnamese-aio")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("VLAI-AIVN/llava-vietnamese-aio")
model = AutoModelForMultimodalLM.from_pretrained("VLAI-AIVN/llava-vietnamese-aio")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use VLAI-AIVN/llava-vietnamese-aio with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VLAI-AIVN/llava-vietnamese-aio"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VLAI-AIVN/llava-vietnamese-aio",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VLAI-AIVN/llava-vietnamese-aio

SGLang

How to use VLAI-AIVN/llava-vietnamese-aio with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VLAI-AIVN/llava-vietnamese-aio" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VLAI-AIVN/llava-vietnamese-aio",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VLAI-AIVN/llava-vietnamese-aio" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VLAI-AIVN/llava-vietnamese-aio",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VLAI-AIVN/llava-vietnamese-aio with Docker Model Runner:
```
docker model run hf.co/VLAI-AIVN/llava-vietnamese-aio
```

LLaVA Vietnamese AIO

LLaVA Vietnamese AIO is a Vietnamese vision-language checkpoint for image understanding, visual question answering, and instruction-style multimodal responses. It combines a SigLIP2 vision encoder, a Llama 3.2 1B instruction model, and a trained multimodal projector in a standard Hugging Face Transformers LLaVA layout.

This repository stores the deployable inference checkpoint only. Optimizer, scheduler, and RNG states were intentionally excluded because they are needed only for training resume.

Model Details

Model type: LLaVA-style image-text-to-text model
Vision encoder: google/siglip2-so400m-patch16-384
Language model: meta-llama/Llama-3.2-1B-Instruct
Multimodal bridge: MLP projector
Primary language: Vietnamese
Checkpoint stage: instruction tuning
Checkpoint step: 6000
Best validation metric: eval_loss = 1.0164677494163994
Checkpoint format: standard transformers.LlavaForConditionalGeneration checkpoint

Intended Use

This checkpoint is intended for Vietnamese multimodal experimentation and internal application prototyping, especially:

visual question answering in Vietnamese
short image-grounded instruction following
image description and scene understanding
local testing of the LLAVA project inference stack

It is not intended for safety-critical medical, legal, financial, identity, or surveillance decisions.

Usage

Install the runtime dependencies:

pip install -U "transformers>=4.55.4,<5" "accelerate>=1.12.0,<2" pillow torch

Minimal Python inference:

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

repo_id = "VLAI-AIVN/llava-vietnamese-aio"

processor = AutoProcessor.from_pretrained(repo_id, use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
    repo_id,
    dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)
model.eval()

image = Image.open("sample.jpg").convert("RGB")
question = "Hãy mô tả nội dung chính của ảnh này."
messages = [
    {
        "role": "system",
        "content": "Bạn là trợ lý thị giác tiếng Việt. Trả lời chính xác, ngắn gọn dựa trên hình ảnh.",
    },
    {"role": "user", "content": f"<image>\n{question}"},
]

prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(text=prompt, images=image, return_tensors="pt", truncation=True, max_length=2048)
device = next(model.parameters()).device
vision_dtype = next(model.vision_tower.parameters()).dtype
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(dtype=vision_dtype)

eos_ids = {processor.tokenizer.eos_token_id}
for token in ("<|eot_id|>", "<|end_of_text|>"):
    token_id = processor.tokenizer.convert_tokens_to_ids(token)
    if isinstance(token_id, int) and token_id >= 0:
        eos_ids.add(token_id)

with torch.inference_mode():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=sorted(i for i in eos_ids if i is not None),
        pad_token_id=processor.tokenizer.pad_token_id,
    )

input_len = inputs["input_ids"].shape[1]
answer = processor.tokenizer.decode(generated_ids[0, input_len:], skip_special_tokens=True).strip()
print(answer)

The repository still keeps the original project-native checkpoint files under llm/, processor/, tokenizer/, projector.pt, and lm_head.pt for reproducibility, but standard Transformers inference should use the root config.json, root processor/tokenizer files, and root model-*.safetensors shards.

Local Gradio Demo

This package includes a local Gradio app with the project logo in the header.

pip install -r requirements.txt
python hf_upload/llava-vietnamese-aio/app.py

By default, the app uses this package as the checkpoint path and loads the model lazily on the first inference request. You can override runtime settings with environment variables:

LLAVA_CHECKPOINT=/path/to/checkpoint \
LLAVA_DEVICE=cuda:0 \
GRADIO_SERVER_PORT=7860 \
python hf_upload/llava-vietnamese-aio/app.py

Open the printed local URL, upload an image, enter a Vietnamese question, and run inference.

Training Details

The instruction-tuning run used the configuration snapshot in training_config.yaml.

Training stage: instruction_tuning
Mixed precision: bf16
Model dtype: bfloat16
Projector dtype: float32
Optimizer: Adafactor
Batch size: 1
Gradient accumulation: 16
Max text tokens: 2048
Seed: 42

Training Data

The training mix is defined in training_config.yaml and includes:

5CD-AI/Viet-ShareGPT-4o-Text-VQA
5CD-AI/Viet-Localization-VQA
Vietnam tourism image QA data prepared by the project pipeline

The configured sample weights were [40, 55, 5] for the three training sources above.

Evaluation

The best recorded internal validation loss for this checkpoint is:

eval_loss = 1.0164677494163994

No public benchmark score is reported yet. Treat the validation metric as an internal training signal, not as a broad claim of real-world performance.

Limitations

The model may hallucinate details that are not visible in the image.
The model is optimized for Vietnamese prompts and may be weaker on other languages.
OCR-heavy, fine-grained localization, counting, and small-object reasoning can be unreliable.
Fine-grained OCR, localization, counting, and small-object reasoning can be unreliable.
Performance depends on the prompt, image quality, and available inference hardware.

License and Usage

This checkpoint inherits usage constraints from its base models and training data. Review the license and acceptable-use terms for:

meta-llama/Llama-3.2-1B-Instruct
google/siglip2-so400m-patch16-384
the datasets listed in training_config.yaml

Redistribution and production deployment should happen only after confirming that the combined model, data, and application use case satisfy the upstream terms.

Citation and Acknowledgements

This work builds on the LLaVA-style multimodal architecture, Hugging Face Transformers, SigLIP2, and Llama 3.2. Please cite the relevant upstream projects and datasets when using this checkpoint in published work.

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

F16

Model tree for VLAI-AIVN/llava-vietnamese-aio

Base model

google/siglip2-so400m-patch16-384

Finetuned

(3)

this model

Evaluation results

eval_loss on Internal validation mix
self-reported

1.016