File size: 2,224 Bytes
dfc440b
0961fae
 
 
92b0011
0961fae
 
 
dfc440b
 
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
 
92b0011
0961fae
 
92b0011
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
dfc440b
0961fae
 
 
 
 
dfc440b
0961fae
dfc440b
0961fae
 
 
 
 
 
dfc440b
0961fae
 
dfc440b
0961fae
 
 
 
 
 
 
 
 
 
 
 
 
dfc440b
0961fae
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
base_model:
- tokyotech-llm/Llama-3-Swallow-8B-v0.1
- meta-llama/Llama-3.2-11B-Vision-Instruct
- meta-llama/Meta-Llama-3-8B
license: llama3.2
tags:
- merge
---

## Model Information

[Kendamarron/Llama-3.2-11B-Vision-Instruct-Swallow-8B-Merge](https://huggingface.co/Kendamarron/Llama-3.2-11B-Vision-Instruct-Swallow-8B-Merge)の初期バージョンです。

Llama-3.1シリーズの代わりにLlama-3シリーズを使用しています。

Llama-3.1を使用したモデルと体感の出力はあまり変わりません。

### Detail

https://zenn.dev/kendama/articles/280a4089cb8a72

## Recipe
```
Llama-3.2-11B-Vision-Instruct + (Llama-3-Swallow-8B-v0.1 - Meta-Llama-3-8B)
```
- Vision Model: [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
- Base Text Model: [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
- Japanese Text Model: [tokyotech-llm/Llama-3-Swallow-8B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-8B-v0.1)

## License

[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)

## How to use

```python
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "Kendamarron/Llama-3.2-11B-Vision-Instruct-Swallow-8B-Merge-v0.1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "この画像で一句詠んでください。"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))
```