Github | Inference Notebook | Dataset | Model Family
Model Details
We have developed and released the family of Vista 7B, which includes both a pretrained Projector and a finetuned version of the Vietnamese Vision Language Model (VLM). This model is optimized for image description tasks.
We continue to expand Vistral 7B's vision capabilities using the Llava approach, leveraging our proprietary Vista dataset with Siglip as an image encoder.
Disclaimer: The model has not been trained on OCR tasks and may perform poorly in OCR and graph analysis. Use with caution, as we have not focused on correcting the factual knowledge of the model.
Model developers Vi-VLM
Input Models input text and image.
Output Models generate image descriptions only.
Model Architecture Mistral.
Intended Use
Intended Use Cases Vista is primarily intended for research applications within the Vietnamese context. This version aims to further improve the Vietnamese Vision Language Model capabilities.
Out-of-scope The use of Vista in any manner that violates applicable laws or regulations is strictly prohibited.
How to use
Use with Kaggle Notebook
To run inference using the model, follow the steps outlined in our Kaggle inference notebook
Training process
Training Metrics Image: Below is a snapshot of the training metrics visualized.
Weights & Biases: Monitor the training progress and access additional analytics at our WandB project page.
Training Data
Pretrained Model:
- Dataset: ShareGPT4V and a subset of WIT from the Vista dataset.
Finetuned Model:
- Tasks:
- Conversation
- Complex reasoning
- Detailed description
- Dataset: Subset from the Vista dataset.
Hardware
GPU Configuration: Cluster of 2x NVIDIA A100-SXM4-40GB, provided by Google Cloud Research and VietAI. GPU Usage:
- Pretrain: 4 hours of GPU time.
- Finetune: 14 hours of GPU time.
Training Arguments
Parameter | Pretrain | Finetune (LoRA) |
---|---|---|
Epoch | 1 | 1 |
Global batch size | 16 | 16 |
Learning Scheduler | cosine with warmup | cosine with warmup |
Optimizer | AdamW | AdamW |
Warmup Ratio | 0.03 | 0.03 |
Weight Decay | 0.00 | 0.00 |
Learning rate (LLM) | - | 1.25e-5 |
Learning rate (Projector) | 1e-3 | 1.25e-6 |
rank | - | 128 |
alpha | - | 256 |
Target modules | - | all linear layers |
Examples
Responsibility & Safety
We are committed to promoting an open approach to the development of Vietnamese AI, believing that it fosters better and faster innovation. This initiative is designed to bolster the efforts of the Vietnamese AI community.
The Vista model is built for versatility across a broad spectrum of applications. However, it is important to note that it is not tailored to meet every specific developer preference for all conceivable use cases out-of-the-box. Such preferences are inherently diverse and vary significantly across different applications.
Ethical Considerations and Limitations
The responses from this model are not intended to offend or insult any individual or organization. Therefore, the answers provided should be considered as reference material only, and users should critically assess their accuracy.
The model still has significant limitations in terms of knowledge and practical task performance capabilities.
Future Work
We are committed to continuous improvement of the model, with specific plans to:
- Further train the finetuned model on diverse Vision Language tasks to enhance its performance.
- Improve the factual knowledge of the model, particularly to better adapt to Vietnamese cultural contexts.
- Investigate the combination of different vision encoders to capture more comprehensive image features.
Acknowledgement
We express our deep gratitude to various contributors and supporters of our project:
[LLaVA]: Significant portions of the source code and instructions were utilized from the LLaVA repository, with modifications to adapt to our model architecture.
[Vistral]: Immense thanks to the Vistral development team for creating an outstanding LLM for Vietnamese, accessible at Hugging Face - Vistral-7B-Chat.
[Siglip]: Grateful for the innovative multilingual vision encoder developed by the Siglip team, detailed in their research paper.
Sponsors: Special thanks to [VietAI] and [Google Cloud Research] for their diamond-level sponsorship, providing the computing resources essential for our project.
Mentors: Our heartfelt appreciation goes to our mentors, Anh Duong Nguyen and Thanh Le, for their guidance and support.
Citation Information
BibTeX:
@article{ViVLM Vistral Vision 2024,
title={Vistral V},
author={Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van and Tran, Oanh Ngoc},
year=2024,
month=June},
url={https://huggingface.co/Vi-VLM/Vistral-V-7B}
- Downloads last month
- 426