|
--- |
|
library_name: transformers |
|
license: mit |
|
language: |
|
- vi |
|
- en |
|
- zh |
|
base_model: |
|
- OpenGVLab/InternVL2_5-1B |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/AxRFDUt8uft6HVxBWuXgJ.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/DrUCZuXuMz47uVU4zqnJ4.png) |
|
|
|
|
|
## Vintern-1B-v2 ❄️ (Viet-InternVL2-1B-v2) - The LLaVA 🌋 Challenger |
|
|
|
We are excited to introduce **Vintern-1B-v2** the Vietnamese 🇻🇳 multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [Viet-InternVL2-1B](https://huggingface.co/5CD-AI/Viet-InternVL2-1B) model on over 3 million specialized image-question-answer pairs for optical character recognition 🔍, text recognition 🔤, document extraction 📑, and general VQA. The model can be integrated into various on-device applications 📱, demonstrating its versatility and robust capabilities. |
|
|
|
[**\[🤗 HF Demo\]**](https://huggingface.co/spaces/khang119966/Vintern-v2-Demo) |
|
|
|
The special thing is that our model can be easily finetuned with a T4 GPU on Google Colab by following the instructions provided at the end of this section. |
|
|
|
## Model Details |
|
|
|
| Model Name | Vision Part | Language Part | |
|
| :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | |
|
| Vintern-1B-v2 | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | |
|
|
|
|
|
Vintern-1B-v2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B-v2 consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct). |
|
|
|
## Training details 📚 |
|
|
|
The fine-tuning dataset was meticulously sampled in part from the following datasets: |
|
[Viet-OCR-VQA 📚](https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA), [Viet-Doc-VQA 📄](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA), [Viet-Doc-VQA-II 📑](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA-II), [Vista 🖼️](https://huggingface.co/datasets/Vi-VLM/Vista), [Viet-Receipt-VQA 🧾](https://huggingface.co/datasets/5CD-AI/Viet-Receipt-VQA), [Viet-Sketches-VQA ✏️](https://huggingface.co/datasets/5CD-AI/Viet-Sketches-VQA), [Viet-Geometry-VQA 📐](https://huggingface.co/datasets/5CD-AI/Viet-Geometry-VQA), [Viet-Wiki-Handwriting ✍️](https://huggingface.co/datasets/5CD-AI/Viet-Wiki-Handwriting), [Viet-ComputerScience-VQA 💻](https://huggingface.co/datasets/5CD-AI/Viet-ComputerScience-VQA), [Viet-Handwriting-gemini-VQA 🖋️](https://huggingface.co/datasets/5CD-AI/Viet-Handwriting-gemini-VQA), [Viet-Menu-gemini-VQA 🍽️](https://huggingface.co/datasets/5CD-AI/Viet-Menu-gemini-VQA), [Viet-Vintext-gemini-VQA 📜](https://huggingface.co/datasets/5CD-AI/Viet-Vintext-gemini-VQA), [Viet-OpenViVQA-gemini-VQA 🧠](https://huggingface.co/datasets/5CD-AI/Viet-OpenViVQA-gemini-VQA), [Viet-Resume-VQA 📃](https://huggingface.co/datasets/5CD-AI/Viet-Resume-VQA), [Viet-ViTextVQA-gemini-VQA 📑](https://huggingface.co/datasets/5CD-AI/Viet-ViTextVQA-gemini-VQA) |
|
|
|
## Benchmarks 📈 |
|
|
|
|
|
|
|
## Examples |
|
|
|
<div align="center"> |
|
<img src="ex_images/1.png" width="500"/> |
|
</div> |
|
|
|
``` |
|
|
|
``` |
|
|
|
<div align="center"> |
|
<img src="ex_images/4.jpg" width="500"/> |
|
</div> |
|
|
|
``` |
|
|
|
``` |
|
|
|
<div align="center"> |
|
<img src="ex_images/2.jpg" width="500"/> |
|
</div> |
|
|
|
``` |
|
|
|
``` |
|
|
|
<div align="center"> |
|
<img src="ex_images/3.png" width="400"/> |
|
</div> |
|
|
|
``` |
|
|
|
``` |
|
|
|
<div align="center"> |
|
<img src="ex_images/5.jpg" width="400"/> |
|
</div> |
|
|
|
``` |
|
|
|
``` |
|
|
|
<div align="center"> |
|
<img src="ex_images/6.png" width="400"/> |
|
</div> |
|
|
|
|
|
``` |
|
|
|
``` |
|
|
|
## Quickstart |
|
|
|
Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. |
|
To run inference using the model, follow the steps outlined in our Colab inference notebook |
|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZD1oB56PF0lF66RCuTVJYLTEV0tM3CFf?usp=sharing) |
|
|
|
```python |
|
import numpy as np |
|
import torch |
|
import torchvision.transforms as T |
|
# from decord import VideoReader, cpu |
|
from PIL import Image |
|
from torchvision.transforms.functional import InterpolationMode |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
IMAGENET_MEAN = (0.485, 0.456, 0.406) |
|
IMAGENET_STD = (0.229, 0.224, 0.225) |
|
|
|
def build_transform(input_size): |
|
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD |
|
transform = T.Compose([ |
|
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), |
|
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), |
|
T.ToTensor(), |
|
T.Normalize(mean=MEAN, std=STD) |
|
]) |
|
return transform |
|
|
|
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): |
|
best_ratio_diff = float('inf') |
|
best_ratio = (1, 1) |
|
area = width * height |
|
for ratio in target_ratios: |
|
target_aspect_ratio = ratio[0] / ratio[1] |
|
ratio_diff = abs(aspect_ratio - target_aspect_ratio) |
|
if ratio_diff < best_ratio_diff: |
|
best_ratio_diff = ratio_diff |
|
best_ratio = ratio |
|
elif ratio_diff == best_ratio_diff: |
|
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: |
|
best_ratio = ratio |
|
return best_ratio |
|
|
|
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): |
|
orig_width, orig_height = image.size |
|
aspect_ratio = orig_width / orig_height |
|
|
|
# calculate the existing image aspect ratio |
|
target_ratios = set( |
|
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if |
|
i * j <= max_num and i * j >= min_num) |
|
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) |
|
|
|
# find the closest aspect ratio to the target |
|
target_aspect_ratio = find_closest_aspect_ratio( |
|
aspect_ratio, target_ratios, orig_width, orig_height, image_size) |
|
|
|
# calculate the target width and height |
|
target_width = image_size * target_aspect_ratio[0] |
|
target_height = image_size * target_aspect_ratio[1] |
|
blocks = target_aspect_ratio[0] * target_aspect_ratio[1] |
|
|
|
# resize the image |
|
resized_img = image.resize((target_width, target_height)) |
|
processed_images = [] |
|
for i in range(blocks): |
|
box = ( |
|
(i % (target_width // image_size)) * image_size, |
|
(i // (target_width // image_size)) * image_size, |
|
((i % (target_width // image_size)) + 1) * image_size, |
|
((i // (target_width // image_size)) + 1) * image_size |
|
) |
|
# split the image |
|
split_img = resized_img.crop(box) |
|
processed_images.append(split_img) |
|
assert len(processed_images) == blocks |
|
if use_thumbnail and len(processed_images) != 1: |
|
thumbnail_img = image.resize((image_size, image_size)) |
|
processed_images.append(thumbnail_img) |
|
return processed_images |
|
|
|
def load_image(image_file, input_size=448, max_num=12): |
|
image = Image.open(image_file).convert('RGB') |
|
transform = build_transform(input_size=input_size) |
|
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) |
|
pixel_values = [transform(image) for image in images] |
|
pixel_values = torch.stack(pixel_values) |
|
return pixel_values |
|
|
|
model = AutoModel.from_pretrained( |
|
"5CD-AI/Vintern-1B-v2", |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True, |
|
).eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-1B-v2", trust_remote_code=True, use_fast=False) |
|
|
|
test_image = 'test-image.jpg' |
|
|
|
pixel_values = load_image(test_image, max_num=12).to(torch.bfloat16).cuda() |
|
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5) |
|
|
|
question = '<image>\nMô tả hình ảnh một cách chi tiết.' |
|
|
|
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) |
|
print(f'User: {question}\nAssistant: {response}') |
|
|
|
#question = "Câu hỏi khác ......" |
|
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) |
|
#print(f'User: {question}\nAssistant: {response}') |
|
``` |
|
|
|
## Finetune on your Data |
|
|
|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bK6fpWfResjv9UxWoKHDStXQ8bop3a6Z?usp=sharing) |
|
|
|
|
|
## Citation |
|
|
|
``` |
|
@misc{doan2024vintern1befficientmultimodallarge, |
|
title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, |
|
author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang}, |
|
year={2024}, |
|
eprint={2408.12480}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2408.12480}, |
|
} |
|
``` |
|
|