MoE-LLaVA-Qwen1.5-1.8B×4-Top2: When Vision meet Small-scaled Language Model and Vietnamese Synthetic Dataset

Introducing MoE-LLaVA-Qwen1.5-1.8B×4-Top2 for Vietnamese

We are excited to present MoE-LLaVA-Qwen1.5-1.8B×4-Top2, tailored for the Vietnamese language. This model is part of our ongoing efforts to develop Vision Language Models (VLM) for Vietnamese, a domain that is currently limited and predominantly features larger models (~7B parameters). Our model activates approximately 2.2B 🤗😎 parameters per call, significantly reducing the memory footprint, and it can be quantized for local execution.

Training Dataset

Our model is trained on the comprehensive Vi-VLM/Vista dataset, which includes around 700,000 Vietnamese vision-language samples curated by Gemini Pro. We employed various prompt engineering techniques, including:

Few-shot Learning
Caption-based Prompting
Image-based Prompting

For the COCO dataset, we utilized Llava-style prompts to generate data. For the ShareGPT4V dataset, translation prompts were applied.

Techniques Used

MoE-LLaVA: MoE-LLaVA

Evaluation

Comming soon 🫡

Bias, Risks, and Limitations

The dataset may contain biases originating from its sources. Users should remain aware of these potential biases when utilizing the dataset.

More Information

This dataset represents the first stage of a two-stage development process for a larger model. Stay tuned for future developments by subscribing to our updates.

tuanio
/

ft-moellava-qwen1.5-1.8b-vista-lora-2ep