MoE-LLaVA-Qwen1.5-1.8B×4-Top2: When Vision meet Small-scaled Language Model and Vietnamese Synthetic Dataset
Introducing MoE-LLaVA-Qwen1.5-1.8B×4-Top2 for Vietnamese
We are excited to present MoE-LLaVA-Qwen1.5-1.8B×4-Top2, tailored for the Vietnamese language. This model is part of our ongoing efforts to develop Vision Language Models (VLM) for Vietnamese, a domain that is currently limited and predominantly features larger models (~7B parameters). Our model activates approximately 2.2B 🤗😎 parameters per call, significantly reducing the memory footprint, and it can be quantized for local execution.
Training Dataset
Our model is trained on the comprehensive Vi-VLM/Vista dataset, which includes around 700,000 Vietnamese vision-language samples curated by Gemini Pro. We employed various prompt engineering techniques, including:
- Few-shot Learning
- Caption-based Prompting
- Image-based Prompting
For the COCO dataset, we utilized Llava-style prompts to generate data. For the ShareGPT4V dataset, translation prompts were applied.
Techniques Used
- MoE-LLaVA: MoE-LLaVA
Evaluation
- Comming soon 🫡
Bias, Risks, and Limitations
The dataset may contain biases originating from its sources. Users should remain aware of these potential biases when utilizing the dataset.
More Information
This dataset represents the first stage of a two-stage development process for a larger model. Stay tuned for future developments by subscribing to our updates.
- Downloads last month
- 3