bluezora commited on
Commit
e8e6a82
โ€ข
1 Parent(s): 1bda9d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ja
5
+ - en
6
+ pipeline_tag: image-to-text
7
+ ---
8
+
9
+ ## Model Description
10
+
11
+ **llava-calm2-siglip** is an experimental Vision Language Model that can answer questions in Japanese about images.
12
+
13
+ ## Usage
14
+
15
+ <details>
16
+
17
+ ```python
18
+ from PIL import Image
19
+ import requests
20
+ from transformers import AutoProcessor, LlavaForConditionalGeneration
21
+ import torch
22
+
23
+ model = LlavaForConditionalGeneration.from_pretrained(
24
+ "cyberagent/llava-calm2-siglip",
25
+ torch_dtype=torch.bfloat16,
26
+ ).to(0)
27
+
28
+ processor = AutoProcessor.from_pretrained("cyberagent/llava-calm2-siglip")
29
+
30
+ prompt = """USER: <image>
31
+ ใ“ใฎ็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚
32
+ ASSISTANT: """
33
+
34
+ url = "https://unsplash.com/photos/LipkIP4fXbM/download?force=true&w=640"
35
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
36
+
37
+ inputs = processor(text=prompt, images=image, return_tensors="pt").to(0, torch.bfloat16)
38
+ generate_ids = model.generate(**inputs,
39
+ max_length=500,
40
+ do_sample=True,
41
+ temperature=0.2,
42
+ )
43
+ output = processor.tokenizer.decode(generate_ids[0][:-1], clean_up_tokenization_spaces=False)
44
+
45
+ print(output)
46
+
47
+ # USER: <image>
48
+ # ใ“ใฎ็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚
49
+ # ASSISTANT: ็”ปๅƒใซใฏใ€ๆœจ่ฃฝใฎใƒ†ใƒผใƒ–ใƒซใฎไธŠใซ็ฝฎใ‹ใ‚ŒใŸใ€ใŸใ“็„ผใๅ™จใง็„ผใ‹ใ‚ŒใŸ3ใคใฎใŸใ“็„ผใใŒๆ˜ ใฃใฆใ„ใพใ™ใ€‚ใŸใ“็„ผใใฏใ€ๅฐ้บฆ็ฒ‰ใ‚’ใƒ™ใƒผใ‚นใซใ—ใŸ็”Ÿๅœฐใ‚’ไธธใ็„ผใใ€ไธญใซใ‚ฟใ‚ณใ‚„ๅคฉใ‹ใ™ใ€็ด…ใ‚ทใƒงใ‚ฆใ‚ฌใชใฉใฎๅ…ทๆใ‚’ๅ…ฅใ‚ŒใŸใ‚‚ใฎใงใ™ใ€‚ใŸใ“็„ผใใฏใ€ใ‚ฝใƒผใ‚นใ€ใƒžใƒจใƒใƒผใ‚บใ€้’ๆตท่‹”ใ€ใ‹ใคใŠใถใ—ใ‚’ใ‹ใ‘ใฆ้ฃŸในใ‚‹ใ“ใจใŒๅคšใ„ใงใ™ใ€‚
50
+ ```
51
+
52
+ </details>
53
+
54
+ ## Chat Template
55
+ ```
56
+ USER: <image>
57
+ {user_message1}
58
+ ASSISTANT: {assistant_message1}<|endoftext|>
59
+ USER: {user_message2}
60
+ ASSISTANT: {assistant_message2}<|endoftext|>
61
+ USER: {user_message3}
62
+ ASSISTANT: {assistant_message3}<|endoftext|>
63
+ ```
64
+
65
+ ## Model Details
66
+
67
+ * **Model size**: 7B
68
+ * **Model type**: Transformer-based Vision Language Model
69
+ * **Language(s)**: Japanese, English
70
+ * **Developed by**: [CyberAgent, Inc.](https://www.cyberagent.co.jp/)
71
+ * **License**: Apache-2.0
72
+
73
+ ## Training
74
+
75
+ This model is a visual language instruction-following model based on [LLaVA 1.5](https://arxiv.org/abs/2310.03744). It utilizes [cyberagent/calm2-7b-chat](https://huggingface.co/cyberagent/calm2-7b-chat) as its language model and [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) as its image encoder. During training, the first stage involved learning the MLP projection from scratch, which was followed by additional training of both the language model and the MLP projection in the second stage.
76
+
77
+ ## Dataset for Visual Instruction Tuning
78
+ In the second stage of Visual Instruction Tuning, we train on a dataset of conversations about images. These conversational data are generated using our in-house large-scale Japanese language model, based on images, captions, object labels, and bounding boxes from the [MS-COCO](https://cocodataset.org/#home) and [VisualGenome](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html). For methods of generating conversational datasets for Visual Instruction Tuning without using images, please refer to [LLaVA 1.5](https://arxiv.org/abs/2310.03744).
79
+
80
+ ## Evaluation Results
81
+
82
+ ### LLaVA Bench In-the-wild
83
+ | Model | Detail | Conv | Complex | Average |
84
+ | - | -: | -: | -: | -: |
85
+ | [llava-calm2-siglip](https://huggingface.co/cyberagent/llava-calm2-siglip) | **51.2** | 55.9 | **65.51** | **57.54** |
86
+ | [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 26.02 | 24.84 | 29.18 | 26.68 |
87
+ | [SakanaAI EvoVLM-JP](https://huggingface.co/SakanaAI/EvoVLM-JP-v1-7B) | 49.59 | **65.49** | 54.22 | 56.43 |
88
+ | [Heron BLIP v1 (620k)](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v1-llava-620k) | 45.45 | 32.90 | 56.89 | 45.08 |
89
+ | [Heron GIT](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1) | 40.98 | 39.87 | 54.59 | 45.15 |
90
+ - [LLaVA Bench In-the-wild](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) translated into Japanese.
91
+
92
+ ### Heron-Bench
93
+ | Model | Detail | Conv | Complex | Average |
94
+ | - | -: | -: | -: | -: |
95
+ | [llava-calm2-siglip](https://huggingface.co/cyberagent/llava-calm2-siglip) | **53.42** | 50.13 | **52.72** | **52.09** |
96
+ | [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 25.15 | 51.23 | 37.84 | 38.07 |
97
+ | [SakanaAI EvoVLM-JP](https://huggingface.co/SakanaAI/EvoVLM-JP-v1-7B) | 50.31 | 44.42 | 40.47 | 45.07 |
98
+ | [Heron BLIP v1 (620k)](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v1-llava-620k) | 49.09 | 41.51 | 45.72 | 45.44 |
99
+ | [Heron GIT](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1) | 42.77 | **54.20** | 43.53 | 46.83 |
100
+ - [Heron-Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench)
101
+
102
+ ## Use and Limitations
103
+
104
+ ### Intended Use
105
+
106
+ This model is designed for use by the open-source community in vision-language applications and academic research.
107
+
108
+ ### Limitations and biases
109
+
110
+ This model, a general-purpose Japanese VLM, reaches optimal performance when specifically tuned with relevant data for each task.
111
+ Though technically possible, commercial use is advised with caution, and the implementation of mechanisms to filter out inappropriate content is strongly recommended when deployed in production systems.
112
+ This model is not advisable for use in applications that could potentially harm individuals or groups, or cause distress.
113
+ CyberAgent expressly disclaims any liability for direct, indirect, special, incidental, or consequential damages, as well as for any losses that may result from using this model, regardless of the outcomes.
114
+ Users must fully understand these limitations before employing the model.
115
+
116
+ ## Author
117
+
118
+ [Aozora Inagaki](https://huggingface.co/bluezora)