---
license: apache-2.0
datasets:
- Vi-VLM/Vista
language:
- vi
- en
tags:
- vision language model
---

<p>
  <a href="https://github.com/hllj/Vistral-V">Github</a> |
  <a href="https://kaggle.com/kernels/welcome?src=https://github.com/hllj/Vistral-V/blob/master/assets/quickstart_example.ipynb">Inference Notebook</a> |
  <a href="https://huggingface.co/datasets/Vi-VLM/Vista">Dataset</a> |
  <a href="https://huggingface.co/collections/Vi-VLM/vista-668126169f4f7654f07cae66">Model Family</a>
</p>

## Model Details

We have developed and released the family of Vista 7B, which includes both a pretrained Projector and a finetuned version of the Vietnamese Vision Language Model (VLM). This model is optimized for image description tasks.

We continue to expand Vistral 7B's vision capabilities using the [Llava approach](https://github.com/haotian-liu/LLaVA), leveraging our proprietary [Vista dataset](https://huggingface.co/datasets/Vi-VLM/Vista) with [Siglip](https://arxiv.org/abs/2303.15343) as an image encoder.

> **Disclaimer**: The model has not been trained on OCR tasks and may perform poorly in OCR and graph analysis. Use with caution, as we have not focused on correcting the factual knowledge of the model.

**Model developers** Vi-VLM

**Input** Models input text and image.

**Output** Models generate image descriptions only.

**Model Architecture** Mistral.

## Intended Use

**Intended Use Cases** Vista is primarily intended for research applications within the Vietnamese context. This version aims to further improve the Vietnamese Vision Language Model capabilities.

**Out-of-scope** The use of Vista in any manner that violates applicable laws or regulations is strictly prohibited.

## How to use

### Use with Kaggle Notebook
To run inference using the model, follow the steps outlined in our Kaggle inference notebook
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/hllj/Vistral-V/blob/master/assets/quickstart_example.ipynb)

## Training process

**Training Metrics Image**: Below is a snapshot of the training metrics visualized.
  
![Training Metrics](https://cdn-uploads.huggingface.co/production/uploads/630a5ef0e81e1dea2cedcec0/rjf1SL3-o7IUBJerUmCDT.png)

**Weights & Biases**: Monitor the training progress and access additional analytics at our [WandB project page](https://wandb.ai/hllj/huggingface).

### Training Data

**Pretrained Model**: 
  - Dataset: ShareGPT4V and a subset of WIT from the [Vista dataset](https://huggingface.co/datasets/Vi-VLM/Vista).

**Finetuned Model**:
  - Tasks: 
    - Conversation
    - Complex reasoning
    - Detailed description
  - Dataset: Subset from the [Vista dataset](https://huggingface.co/datasets/Vi-VLM/Vista).

### Hardware

**GPU Configuration**: Cluster of 2x NVIDIA A100-SXM4-40GB, provided by Google Cloud Research and [VietAI](https://course.vietai.org/).
**GPU Usage**:
  - **Pretrain**: 4 hours of GPU time.
  - **Finetune**: 14 hours of GPU time.

### Training Arguments

| Parameter                  | Pretrain                | Finetune (LoRA)               |
|----------------------------|-------------------------|-------------------------------|
| **Epoch**                  | 1                       | 1                             |
| **Global batch size**      | 16                      | 16                            |
| **Learning Scheduler**     | cosine with warmup      | cosine with warmup            |
| **Optimizer**              | AdamW                   | AdamW                         |
| **Warmup Ratio**           | 0.03                    | 0.03                          |
| **Weight Decay**           | 0.00                    | 0.00                          |
| **Learning rate (LLM)**    | -                       | 1.25e-5                       |
| **Learning rate (Projector)** | 1e-3                 | 1.25e-6                       |
| **rank**                   | -                       | 128                           |
| **alpha**                  | -                       | 256                           |
| **Target modules**         | -                       | all linear layers             |

## Examples

![image/png](https://cdn-uploads.huggingface.co/production/uploads/630a5ef0e81e1dea2cedcec0/Tot0eFOJF4UQbirJxLv7o.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/630a5ef0e81e1dea2cedcec0/vveQQUPFPDcOj25lvfiwg.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/630a5ef0e81e1dea2cedcec0/tcwilqHy6-cPiIPrI0NP0.png)

## Responsibility & Safety

We are committed to promoting an open approach to the development of Vietnamese AI, believing that it fosters better and faster innovation. This initiative is designed to bolster the efforts of the Vietnamese AI community.

The Vista model is built for versatility across a broad spectrum of applications. However, it is important to note that it is not tailored to meet every specific developer preference for all conceivable use cases out-of-the-box. Such preferences are inherently diverse and vary significantly across different applications.

## Ethical Considerations and Limitations

The responses from this model are not intended to offend or insult any individual or organization. Therefore, the answers provided should be considered as reference material only, and users should critically assess their accuracy.

The model still has significant limitations in terms of knowledge and practical task performance capabilities.

## Future Work

We are committed to continuous improvement of the model, with specific plans to:

1. Further train the finetuned model on diverse Vision Language tasks to enhance its performance.
2. Improve the factual knowledge of the model, particularly to better adapt to Vietnamese cultural contexts.
3. Investigate the combination of different vision encoders to capture more comprehensive image features.

## Acknowledgement

We express our deep gratitude to various contributors and supporters of our project:

- **[LLaVA]**: Significant portions of the source code and instructions were utilized from the [LLaVA repository](https://github.com/haotian-liu/LLaVA), with modifications to adapt to our model architecture.

- **[Vistral]**: Immense thanks to the Vistral development team for creating an outstanding LLM for Vietnamese, accessible at [Hugging Face - Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat).

- **[Siglip]**: Grateful for the innovative multilingual vision encoder developed by the Siglip team, detailed in their [research paper](https://arxiv.org/abs/2303.15343).

- **Sponsors**: Special thanks to [VietAI] and [Google Cloud Research] for their diamond-level sponsorship, providing the computing resources essential for our project.

- **Mentors**: Our heartfelt appreciation goes to our mentors, Anh Duong Nguyen and Thanh Le, for their guidance and support.

## Citation Information

**BibTeX:**

```
@article{ViVLM Vistral Vision 2024,
  title={Vistral V},
  author={Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van and Tran, Oanh Ngoc},
  year=2024,
  month=June},
  url={https://huggingface.co/Vi-VLM/Vistral-V-7B}
```