toshi456 commited on
Commit
dcd0635
1 Parent(s): 7305654

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -3
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - alfredplpl/commoncatalog-cc-by-ext
5
+ - turing-motors/LLaVA-Pretrain-JA
6
+ language:
7
+ - ja
8
+ pipeline_tag: image-to-text
9
+ ---
10
+
11
+ # LLaVA-JP Model Card
12
+
13
+ ## Model detail
14
+
15
+ **Model type:**
16
+
17
+ LLaVA-JP is a vision-language model that can converse about input images.<br>
18
+ This model is an LVLM model trained using [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) as the image encoder and [llm-jp/llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) as the text decoder. supports the input of 768 x 768 high resolution images by scaling_on_scales method.
19
+
20
+ **Training:**
21
+
22
+ This model was initially trained with the Vision Projector using LLaVA-Pretrain-JA.<br>
23
+ In the second phase, it was fine-tuned with 10.5k of commoncatalog-cc-by-ext.
24
+
25
+ resources for more information: https://github.com/tosiyuki/LLaVA-JP/tree/main
26
+
27
+ ## How to use the model
28
+ **1. Download dependencies**
29
+ ```
30
+ git clone https://github.com/tosiyuki/LLaVA-JP.git
31
+ ```
32
+
33
+ **2. Inference**
34
+ ```python
35
+ import torch
36
+ import transformers
37
+ from PIL import Image
38
+
39
+ from transformers.generation.streamers import TextStreamer
40
+ from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
41
+ from llava.conversation import conv_templates, SeparatorStyle
42
+ from llava.model.llava_gpt2 import LlavaGpt2ForCausalLM
43
+ from llava.train.dataset import tokenizer_image_token
44
+
45
+
46
+ if __name__ == "__main__":
47
+ model_path = 'toshi456/llava-jp-1.3b-v1.1-commoncatalog-cc-by-ext-10k'
48
+ device = "cuda" if torch.cuda.is_available() else "cpu"
49
+ torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32
50
+
51
+ model = LlavaGpt2ForCausalLM.from_pretrained(
52
+ model_path,
53
+ low_cpu_mem_usage=True,
54
+ use_safetensors=True,
55
+ torch_dtype=torch_dtype,
56
+ device_map=device,
57
+ )
58
+ tokenizer = transformers.AutoTokenizer.from_pretrained(
59
+ model_path,
60
+ model_max_length=1532,
61
+ padding_side="right",
62
+ use_fast=False,
63
+ )
64
+ model.eval()
65
+
66
+ conv_mode = "v1"
67
+ conv = conv_templates[conv_mode].copy()
68
+
69
+ # image pre-process
70
+ image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
71
+ image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
72
+
73
+ image_size = model.get_model().vision_tower.image_processor.size["height"]
74
+ if model.get_model().vision_tower.scales is not None:
75
+ image_size = model.get_model().vision_tower.image_processor.size["height"] * len(model.get_model().vision_tower.scales)
76
+
77
+ if device == "cuda":
78
+ image_tensor = model.get_model().vision_tower.image_processor(
79
+ image,
80
+ return_tensors='pt',
81
+ size={"height": image_size, "width": image_size}
82
+ )['pixel_values'].half().cuda().to(torch_dtype)
83
+ else:
84
+ image_tensor = model.get_model().vision_tower.image_processor(
85
+ image,
86
+ return_tensors='pt',
87
+ size={"height": image_size, "width": image_size}
88
+ )['pixel_values'].to(torch_dtype)
89
+
90
+ # create prompt
91
+ # ユーザー: <image>\n{prompt}
92
+ prompt = "画像について説明してください。"
93
+ inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
94
+ conv.append_message(conv.roles[0], inp)
95
+ conv.append_message(conv.roles[1], None)
96
+ prompt = conv.get_prompt()
97
+
98
+ input_ids = tokenizer_image_token(
99
+ prompt,
100
+ tokenizer,
101
+ IMAGE_TOKEN_INDEX,
102
+ return_tensors='pt'
103
+ ).unsqueeze(0)
104
+ if device == "cuda":
105
+ input_ids = input_ids.to(device)
106
+
107
+ input_ids = input_ids[:, :-1] # </sep>がinputの最後に入るので削除する
108
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
109
+ keywords = [stop_str]
110
+ streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)
111
+
112
+ # predict
113
+ with torch.inference_mode():
114
+ output_id = model.generate(
115
+ inputs=input_ids,
116
+ images=image_tensor,
117
+ do_sample=False,
118
+ temperature=1.0,
119
+ top_p=1.0,
120
+ max_new_tokens=256,
121
+ streamer=streamer,
122
+ use_cache=True,
123
+ )
124
+
125
+ """画像には、木製の表面に座っている猫が描かれています。猫は、ラップトップの画面に集中しています。ラップトップは、黒い金属フレームと白いキーボードを持つ、鮮やかなオレンジ色です。猫の目は閉じており、リラックスした状態を示唆しています。背景は、猫のラップトップとその周囲の詳細を強調する灰色のテクスチャーです。画像にはテキストや他のオブジェクトは含まれていません。猫とラップトップの相対的な位置関係は、猫がラップトップの画面に集中していることを示唆しています。画像には他のオブジェクトや行動は含まれていません。<EOD|LLM-jp>"""
126
+ ```
127
+
128
+ ## Training dataset
129
+ **Stage1 Pretrain**
130
+ - [LLaVA-Pretrain-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA)
131
+
132
+ **Stage2 Fine-tuning**
133
+ - [commoncatalog-cc-by-ext](https://huggingface.co/datasets/alfredplpl/commoncatalog-cc-by-ext)
134
+
135
+ ## Acknowledgement
136
+ - [LLaVA](https://llava-vl.github.io/)
137
+ - [LLM-jp](https://llm-jp.nii.ac.jp/)
138
+ - [scaling_on_scales](https://github.com/bfshi/scaling_on_scales/tree/master)
139
+
140
+ ## License
141
+ Apache License 2.0