LLaVA Vietnamese AIO

LLaVA Vietnamese AIO is a Vietnamese vision-language checkpoint for image understanding, visual question answering, and instruction-style multimodal responses. It combines a SigLIP2 vision encoder, a Llama 3.2 1B instruction model, and a trained multimodal projector in a standard Hugging Face Transformers LLaVA layout.

This repository stores the deployable inference checkpoint only. Optimizer, scheduler, and RNG states were intentionally excluded because they are needed only for training resume.

Model Details

  • Model type: LLaVA-style image-text-to-text model
  • Vision encoder: google/siglip2-so400m-patch16-384
  • Language model: meta-llama/Llama-3.2-1B-Instruct
  • Multimodal bridge: MLP projector
  • Primary language: Vietnamese
  • Checkpoint stage: instruction tuning
  • Checkpoint step: 6000
  • Best validation metric: eval_loss = 1.0164677494163994
  • Checkpoint format: standard transformers.LlavaForConditionalGeneration checkpoint

Intended Use

This checkpoint is intended for Vietnamese multimodal experimentation and internal application prototyping, especially:

  • visual question answering in Vietnamese
  • short image-grounded instruction following
  • image description and scene understanding
  • local testing of the LLAVA project inference stack

It is not intended for safety-critical medical, legal, financial, identity, or surveillance decisions.

Usage

Install the runtime dependencies:

pip install -U "transformers>=4.55.4,<5" "accelerate>=1.12.0,<2" pillow torch

Minimal Python inference:

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

repo_id = "VLAI-AIVN/llava-vietnamese-aio"

processor = AutoProcessor.from_pretrained(repo_id, use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
    repo_id,
    dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)
model.eval()

image = Image.open("sample.jpg").convert("RGB")
question = "Hãy mô tả nội dung chính của ảnh này."
messages = [
    {
        "role": "system",
        "content": "Bạn là trợ lý thị giác tiếng Việt. Trả lời chính xác, ngắn gọn dựa trên hình ảnh.",
    },
    {"role": "user", "content": f"<image>\n{question}"},
]

prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(text=prompt, images=image, return_tensors="pt", truncation=True, max_length=2048)
device = next(model.parameters()).device
vision_dtype = next(model.vision_tower.parameters()).dtype
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(dtype=vision_dtype)

eos_ids = {processor.tokenizer.eos_token_id}
for token in ("<|eot_id|>", "<|end_of_text|>"):
    token_id = processor.tokenizer.convert_tokens_to_ids(token)
    if isinstance(token_id, int) and token_id >= 0:
        eos_ids.add(token_id)

with torch.inference_mode():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=sorted(i for i in eos_ids if i is not None),
        pad_token_id=processor.tokenizer.pad_token_id,
    )

input_len = inputs["input_ids"].shape[1]
answer = processor.tokenizer.decode(generated_ids[0, input_len:], skip_special_tokens=True).strip()
print(answer)

The repository still keeps the original project-native checkpoint files under llm/, processor/, tokenizer/, projector.pt, and lm_head.pt for reproducibility, but standard Transformers inference should use the root config.json, root processor/tokenizer files, and root model-*.safetensors shards.

Local Gradio Demo

This package includes a local Gradio app with the project logo in the header.

pip install -r requirements.txt
python hf_upload/llava-vietnamese-aio/app.py

By default, the app uses this package as the checkpoint path and loads the model lazily on the first inference request. You can override runtime settings with environment variables:

LLAVA_CHECKPOINT=/path/to/checkpoint \
LLAVA_DEVICE=cuda:0 \
GRADIO_SERVER_PORT=7860 \
python hf_upload/llava-vietnamese-aio/app.py

Open the printed local URL, upload an image, enter a Vietnamese question, and run inference.

Training Details

The instruction-tuning run used the configuration snapshot in training_config.yaml.

  • Training stage: instruction_tuning
  • Mixed precision: bf16
  • Model dtype: bfloat16
  • Projector dtype: float32
  • Optimizer: Adafactor
  • Batch size: 1
  • Gradient accumulation: 16
  • Max text tokens: 2048
  • Seed: 42

Training Data

The training mix is defined in training_config.yaml and includes:

  • 5CD-AI/Viet-ShareGPT-4o-Text-VQA
  • 5CD-AI/Viet-Localization-VQA
  • Vietnam tourism image QA data prepared by the project pipeline

The configured sample weights were [40, 55, 5] for the three training sources above.

Evaluation

The best recorded internal validation loss for this checkpoint is:

eval_loss = 1.0164677494163994

No public benchmark score is reported yet. Treat the validation metric as an internal training signal, not as a broad claim of real-world performance.

Limitations

  • The model may hallucinate details that are not visible in the image.
  • The model is optimized for Vietnamese prompts and may be weaker on other languages.
  • OCR-heavy, fine-grained localization, counting, and small-object reasoning can be unreliable.
  • Fine-grained OCR, localization, counting, and small-object reasoning can be unreliable.
  • Performance depends on the prompt, image quality, and available inference hardware.

License and Usage

This checkpoint inherits usage constraints from its base models and training data. Review the license and acceptable-use terms for:

  • meta-llama/Llama-3.2-1B-Instruct
  • google/siglip2-so400m-patch16-384
  • the datasets listed in training_config.yaml

Redistribution and production deployment should happen only after confirming that the combined model, data, and application use case satisfy the upstream terms.

Citation and Acknowledgements

This work builds on the LLaVA-style multimodal architecture, Hugging Face Transformers, SigLIP2, and Llama 3.2. Please cite the relevant upstream projects and datasets when using this checkpoint in published work.

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VLAI-AIVN/llava-vietnamese-aio

Finetuned
(3)
this model

Evaluation results