Model Card for 360VL

360VL is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on LLama3-70B[🤗Meta-Llama-3-70B-Instruct]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.

Github:https://github.com/360CVGroup/360VL

Model Zoo

360VL has released the following versions.

Model Download
360VL-8B 🤗 Hugging Face
360VL-70B 🤗 Hugging Face

Features

360VL offers the following features:

  • Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.

  • Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.

  • Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.

  • Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.

Performance

Model Checkpoints MMBT MMBD MMB-CNT MMB-CND MMMUV MMMUT MME
QWen-VL-Chat 🤗LINK 61.8 60.6 56.3 56.7 37 32.9 1860
mPLUG-Owl2 🤖LINK 66.0 66.5 60.3 59.5 34.7 32.1 1786.4
CogVLM 🤗LINK 65.8 63.7 55.9 53.8 37.3 30.1 1736.6
Monkey-Chat 🤗LINK 72.4 71 67.5 65.8 40.7 - 1887.4
MM1-7B-Chat LINK - 72.3 - - 37.0 35.6 1858.2
IDEFICS2-8B 🤗LINK 75.7 75.3 68.6 67.3 43.0 37.7 1847.6
SVIT-v1.5-13B 🤗LINK 69.1 - 63.1 - 38.0 33.3 1889
LLaVA-v1.5-13B 🤗LINK 69.2 69.2 65 63.6 36.4 33.6 1826.7
LLaVA-v1.6-13B 🤗LINK 70 70.7 68.5 64.3 36.2 - 1901
Honeybee LINK 73.6 74.3 - - 36.2 - 1976.5
YI-VL-34B 🤗LINK 72.4 71.1 70.7 71.4 45.1 41.6 2050.2
360VL-8B 🤗LINK 75.3 73.7 71.1 68.6 39.7 37.1 1944.6
360VL-70B 🤗LINK 78.1 80.4 76.9 77.7 50.8 44.3 2012.3

Quick Start 🤗

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/360VL-8B"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token


image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Model type: 360VL-8B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: meta-llama/Meta-Llama-3-8B-Instruct

Model date: 360VL-8B was trained in April 2024.

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0]

Where to send questions or comments about the model: https://github.com/360CVGroup/360VL

Related Projects

This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!

Downloads last month
10
Safetensors
Model size
8.41B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Datasets used to train ecfirst/360VL_PHI