qihoo360
/

360VL-70B

@@ -1,124 +1,124 @@
----
-license: apache-2.0
-datasets:
-- liuhaotian/LLaVA-CC3M-Pretrain-595K
-- liuhaotian/LLaVA-Instruct-150K
-- FreedomIntelligence/ALLaVA-4V-Chinese
-- shareAI/ShareGPT-Chinese-English-90k
-language:
-- zh
-- en
-pipeline_tag: visual-question-answering
----
-<br>
-<br>
-# Model Card for 360VL
-<p align="center">
-  <img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
-</p>
-**360VL** is developed based on the LLama3 language model and is also the industry's first open source multi-modal large model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
-## Model Zoo
-360VL has released the following versions.
-Model |       Download
-|---|---
-360VL-8B |  [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
-360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
-## Features
-360VL offers the following features:
-- Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
-- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
-- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
-- Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672&times;672.
-## Performance
-| Model               | Checkpoints   | MMB<sub>T  | MMB<sub>D|MMB-CN<sub>T  | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
-|:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
-| QWen-VL-Chat |  [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 |  56.3  |  56.7  |37| 32.9  | 1860 |
-| mPLUG-Owl2 |  [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 |  60.3  |  59.5  |34.7| 32.1  | 1786.4 |
-| CogVLM |  [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9  | 53.8    |37.3| 30.1 | 1736.6|
-| Monkey-Chat |  [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5  | 65.8    |40.7| - | 1887.4|
-| MM1-7B-Chat |  [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | -  | -    |37.0| 35.6 |  1858.2|
-| IDEFICS2-8B |  [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6  | 67.3    |43.0| 37.7 |1847.6|
-| Honeybee |  [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | -  | -    |36.2|  -|1950|
-| SVIT-v1.5-13B|  [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1  |  -  | 38.0| 33.3|1889|
-| LLaVA-v1.5-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65  | 63.6    |36.4| 33.6 | 1826.7|
-| LLaVA-v1.6-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5  | 64.3    |36.2| - |1901|
-| YI-VL-34B |  [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 |  70.7 |   71.4  |45.1| 41.6 |2050.2|
-| **360VL-8B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1   | 68.6    |39.7| 37.1 |  1899.1|
-| **360VL-70B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9   | 77.7    |50.8| 44.3 |  1983.2|
-## Quick Start 🤗
-```Shell
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-from PIL import Image
-checkpoint = "qh360_vl-70B"
-model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
-tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
-vision_tower = model.get_vision_tower()
-vision_tower.load_model()
-vision_tower.to(device="cuda", dtype=torch.float16)
-image_processor = vision_tower.image_processor
-tokenizer.pad_token = tokenizer.eos_token
-image = Image.open("docs/008.jpg").convert('RGB')
-query = "Who is this cartoon character?"
-terminators = [
-    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
-]
-inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
-input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
-images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
-output_ids = model.generate(
-    input_ids,
-    images=images,
-    do_sample=False,
-    eos_token_id=terminators,
-    num_beams=1,
-    max_new_tokens=512,
-    use_cache=True)
-input_token_len = input_ids.shape[1]
-outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
-outputs = outputs.strip()
-print(outputs)
-```
-**Model type:**
-360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
-It is an auto-regressive language model, based on the transformer architecture.
-Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
-**Model date:**
-360VL-70B was trained in May 2024.
-## License
-This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
-The content of this project itself is licensed under the Apache license 2.0
-**Where to send questions or comments about the model:**
-https://github.com/360CVGroup/360VL
-## Related Projects
-This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
-- [Meta Llama 3](https://github.com/meta-llama/llama3)
-- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
-- [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)

+---
+license: apache-2.0
+datasets:
+- liuhaotian/LLaVA-CC3M-Pretrain-595K
+- liuhaotian/LLaVA-Instruct-150K
+- FreedomIntelligence/ALLaVA-4V-Chinese
+- shareAI/ShareGPT-Chinese-English-90k
+language:
+- zh
+- en
+pipeline_tag: visual-question-answering
+---
+<br>
+<br>
+# Model Card for 360VL
+<p align="center">
+  <img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
+</p>
+**360VL** is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
+## Model Zoo
+360VL has released the following versions.
+Model |       Download
+|---|---
+360VL-8B |  [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
+360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
+## Features
+360VL offers the following features:
+- Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
+- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
+- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
+- Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672&times;672.
+## Performance
+| Model               | Checkpoints   | MMB<sub>T  | MMB<sub>D|MMB-CN<sub>T  | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
+|:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
+| QWen-VL-Chat |  [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 |  56.3  |  56.7  |37| 32.9  | 1860 |
+| mPLUG-Owl2 |  [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 |  60.3  |  59.5  |34.7| 32.1  | 1786.4 |
+| CogVLM |  [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9  | 53.8    |37.3| 30.1 | 1736.6|
+| Monkey-Chat |  [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5  | 65.8    |40.7| - | 1887.4|
+| MM1-7B-Chat |  [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | -  | -    |37.0| 35.6 |  1858.2|
+| IDEFICS2-8B |  [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6  | 67.3    |43.0| 37.7 |1847.6|
+| Honeybee |  [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | -  | -    |36.2|  -|1950|
+| SVIT-v1.5-13B|  [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1  |  -  | 38.0| 33.3|1889|
+| LLaVA-v1.5-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65  | 63.6    |36.4| 33.6 | 1826.7|
+| LLaVA-v1.6-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5  | 64.3    |36.2| - |1901|
+| YI-VL-34B |  [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 |  70.7 |   71.4  |45.1| 41.6 |2050.2|
+| **360VL-8B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1   | 68.6    |39.7| 37.1 |  1899.1|
+| **360VL-70B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9   | 77.7    |50.8| 44.3 |  1983.2|
+## Quick Start 🤗
+```Shell
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+from PIL import Image
+checkpoint = "qh360_vl-70B"
+model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
+vision_tower = model.get_vision_tower()
+vision_tower.load_model()
+vision_tower.to(device="cuda", dtype=torch.float16)
+image_processor = vision_tower.image_processor
+tokenizer.pad_token = tokenizer.eos_token
+image = Image.open("docs/008.jpg").convert('RGB')
+query = "Who is this cartoon character?"
+terminators = [
+    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
+]
+inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
+input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
+images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
+output_ids = model.generate(
+    input_ids,
+    images=images,
+    do_sample=False,
+    eos_token_id=terminators,
+    num_beams=1,
+    max_new_tokens=512,
+    use_cache=True)
+input_token_len = input_ids.shape[1]
+outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+outputs = outputs.strip()
+print(outputs)
+```
+**Model type:**
+360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
+It is an auto-regressive language model, based on the transformer architecture.
+Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
+**Model date:**
+360VL-70B was trained in May 2024.
+## License
+This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
+The content of this project itself is licensed under the Apache license 2.0
+**Where to send questions or comments about the model:**
+https://github.com/360CVGroup/360VL
+## Related Projects
+This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
+- [Meta Llama 3](https://github.com/meta-llama/llama3)
+- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
+- [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)