Instructions to use VLAI-AIVN/llava-vietnamese-aio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use VLAI-AIVN/llava-vietnamese-aio with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="VLAI-AIVN/llava-vietnamese-aio") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("VLAI-AIVN/llava-vietnamese-aio") model = AutoModelForMultimodalLM.from_pretrained("VLAI-AIVN/llava-vietnamese-aio") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use VLAI-AIVN/llava-vietnamese-aio with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "VLAI-AIVN/llava-vietnamese-aio" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VLAI-AIVN/llava-vietnamese-aio", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/VLAI-AIVN/llava-vietnamese-aio
- SGLang
How to use VLAI-AIVN/llava-vietnamese-aio with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "VLAI-AIVN/llava-vietnamese-aio" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VLAI-AIVN/llava-vietnamese-aio", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "VLAI-AIVN/llava-vietnamese-aio" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VLAI-AIVN/llava-vietnamese-aio", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use VLAI-AIVN/llava-vietnamese-aio with Docker Model Runner:
docker model run hf.co/VLAI-AIVN/llava-vietnamese-aio
LLaVA Vietnamese AIO
LLaVA Vietnamese AIO is a Vietnamese vision-language checkpoint for image understanding, visual question answering, and instruction-style multimodal responses. It combines a SigLIP2 vision encoder, a Llama 3.2 1B instruction model, and a trained multimodal projector in a standard Hugging Face Transformers LLaVA layout.
This repository stores the deployable inference checkpoint only. Optimizer, scheduler, and RNG states were intentionally excluded because they are needed only for training resume.
Model Details
- Model type: LLaVA-style image-text-to-text model
- Vision encoder:
google/siglip2-so400m-patch16-384 - Language model:
meta-llama/Llama-3.2-1B-Instruct - Multimodal bridge: MLP projector
- Primary language: Vietnamese
- Checkpoint stage: instruction tuning
- Checkpoint step:
6000 - Best validation metric:
eval_loss = 1.0164677494163994 - Checkpoint format: standard
transformers.LlavaForConditionalGenerationcheckpoint
Intended Use
This checkpoint is intended for Vietnamese multimodal experimentation and internal application prototyping, especially:
- visual question answering in Vietnamese
- short image-grounded instruction following
- image description and scene understanding
- local testing of the
LLAVAproject inference stack
It is not intended for safety-critical medical, legal, financial, identity, or surveillance decisions.
Usage
Install the runtime dependencies:
pip install -U "transformers>=4.55.4,<5" "accelerate>=1.12.0,<2" pillow torch
Minimal Python inference:
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
repo_id = "VLAI-AIVN/llava-vietnamese-aio"
processor = AutoProcessor.from_pretrained(repo_id, use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
repo_id,
dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True,
)
model.eval()
image = Image.open("sample.jpg").convert("RGB")
question = "Hãy mô tả nội dung chính của ảnh này."
messages = [
{
"role": "system",
"content": "Bạn là trợ lý thị giác tiếng Việt. Trả lời chính xác, ngắn gọn dựa trên hình ảnh.",
},
{"role": "user", "content": f"<image>\n{question}"},
]
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(text=prompt, images=image, return_tensors="pt", truncation=True, max_length=2048)
device = next(model.parameters()).device
vision_dtype = next(model.vision_tower.parameters()).dtype
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(dtype=vision_dtype)
eos_ids = {processor.tokenizer.eos_token_id}
for token in ("<|eot_id|>", "<|end_of_text|>"):
token_id = processor.tokenizer.convert_tokens_to_ids(token)
if isinstance(token_id, int) and token_id >= 0:
eos_ids.add(token_id)
with torch.inference_mode():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
repetition_penalty=1.1,
eos_token_id=sorted(i for i in eos_ids if i is not None),
pad_token_id=processor.tokenizer.pad_token_id,
)
input_len = inputs["input_ids"].shape[1]
answer = processor.tokenizer.decode(generated_ids[0, input_len:], skip_special_tokens=True).strip()
print(answer)
The repository still keeps the original project-native checkpoint files under llm/, processor/, tokenizer/, projector.pt, and lm_head.pt for reproducibility, but standard Transformers inference should use the root config.json, root processor/tokenizer files, and root model-*.safetensors shards.
Local Gradio Demo
This package includes a local Gradio app with the project logo in the header.
pip install -r requirements.txt
python hf_upload/llava-vietnamese-aio/app.py
By default, the app uses this package as the checkpoint path and loads the model lazily on the first inference request. You can override runtime settings with environment variables:
LLAVA_CHECKPOINT=/path/to/checkpoint \
LLAVA_DEVICE=cuda:0 \
GRADIO_SERVER_PORT=7860 \
python hf_upload/llava-vietnamese-aio/app.py
Open the printed local URL, upload an image, enter a Vietnamese question, and run inference.
Training Details
The instruction-tuning run used the configuration snapshot in training_config.yaml.
- Training stage:
instruction_tuning - Mixed precision:
bf16 - Model dtype:
bfloat16 - Projector dtype:
float32 - Optimizer: Adafactor
- Batch size:
1 - Gradient accumulation:
16 - Max text tokens:
2048 - Seed:
42
Training Data
The training mix is defined in training_config.yaml and includes:
5CD-AI/Viet-ShareGPT-4o-Text-VQA5CD-AI/Viet-Localization-VQA- Vietnam tourism image QA data prepared by the project pipeline
The configured sample weights were [40, 55, 5] for the three training sources above.
Evaluation
The best recorded internal validation loss for this checkpoint is:
eval_loss = 1.0164677494163994
No public benchmark score is reported yet. Treat the validation metric as an internal training signal, not as a broad claim of real-world performance.
Limitations
- The model may hallucinate details that are not visible in the image.
- The model is optimized for Vietnamese prompts and may be weaker on other languages.
- OCR-heavy, fine-grained localization, counting, and small-object reasoning can be unreliable.
- Fine-grained OCR, localization, counting, and small-object reasoning can be unreliable.
- Performance depends on the prompt, image quality, and available inference hardware.
License and Usage
This checkpoint inherits usage constraints from its base models and training data. Review the license and acceptable-use terms for:
meta-llama/Llama-3.2-1B-Instructgoogle/siglip2-so400m-patch16-384- the datasets listed in
training_config.yaml
Redistribution and production deployment should happen only after confirming that the combined model, data, and application use case satisfy the upstream terms.
Citation and Acknowledgements
This work builds on the LLaVA-style multimodal architecture, Hugging Face Transformers, SigLIP2, and Llama 3.2. Please cite the relevant upstream projects and datasets when using this checkpoint in published work.
- Downloads last month
- 4
Model tree for VLAI-AIVN/llava-vietnamese-aio
Base model
google/siglip2-so400m-patch16-384Evaluation results
- eval_loss on Internal validation mixself-reported1.016