File size: 6,725 Bytes
feb57de
634b601
 
 
17b3b21
 
feb57de
 
634b601
feb57de
 
 
634b601
b8bccaf
634b601
feb57de
 
 
 
 
634b601
 
 
 
b8bccaf
feb57de
 
 
 
 
634b601
feb57de
c4d83c0
 
b8bccaf
 
 
1198bed
 
3ce34f4
b8bccaf
c4d83c0
634b601
 
 
feb57de
634b601
 
feb57de
634b601
feb57de
634b601
 
 
 
c4d83c0
634b601
feb57de
634b601
feb57de
2854ae6
634b601
 
 
 
feb57de
634b601
 
 
feb57de
634b601
 
feb57de
634b601
 
feb57de
0bffffd
634b601
feb57de
634b601
 
feb57de
634b601
feb57de
634b601
c4d83c0
0bffffd
 
 
 
 
 
 
 
1198bed
8ad47b6
1198bed
 
 
 
 
 
 
0bffffd
c4d83c0
 
 
 
 
 
634b601
feb57de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
language:
- ko
- en
license: cc-by-nc-sa-4.0
library_name: transformers
---

# Llama-3-KoEn-8B-xtuner-llava-preview πŸŒ‹

<!-- Provide a quick summary of what the model is/does. -->

Llama-3-KoEn-8B-xtuner-llava-preview πŸŒ‹ is Korean based MutliModal based on Llava architecture, merged with [ChatVector](https://arxiv.org/abs/2310.04799) methods leveraging 2 models: 
1) [beomi/Llama-3-KoEn-8B-preview](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview)
2) [xtuner/llava-llama-3-8b-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-transformers)

## Model Details

### Model Description

- **Developed by:** Junbum Lee (Beomi)
- **Model type:** HuggingFace Llava πŸŒ‹
- **Language(s) (NLP):** Korean, English
- **License:** cc-by-nc-sa-4.0 under Llama3 License
- **Merged from model:** [beomi/Llama-3-KoEn-8B-preview](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview) & [xtuner/llava-llama-3-8b-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-transformers)

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

![Cat walking on frozen Han-River, Seoul](https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/NWfoArWI4UPAxpEnolkwT.jpeg)

> Two version recommended
>
> v1. `revision='a38aac3'`: Basic ChatVector, with [25B+ trained KoEn ckpt(rev. d4d25a2)](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview/commit/d4d25a2).
>
> v1-1. `revision='0224971'`: Basic ChatVector, with [40B+ trained KoEn ckpt(rev. ad39b32)](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview/commit/ad39b32cd4207f37f61f16e79d3f4020c5b744ef).
>
> v1-2. `revision='170746c'`: Basic ChatVector, with [80B+ trained KoEn ckpt(rev. b4c45ab)](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview/commit/b4c45ab3355c6ccb9bb1ecdf8a75ded4d6620c7e).
> 
> v2. `revision='4f04d1e'`: Model diff based merging(ref. https://huggingface.co/blog/maywell/llm-feature-transfer), with [25B+ trained KoEn ckpt(rev. d4d25a2)](https://huggingface.co/beomi/Llama-3-KoEn-8B-preview/commit/d4d25a2).

```python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "beomi/Llama-3-KoEn-8B-xtuner-llava-preview"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype='auto', 
    device_map='auto',
    revision='a38aac3', # 'a38aac3' for basic ChatVector, '4f04d1e' for Model diff based merging(ref. https://huggingface.co/blog/maywell/llm-feature-transfer)
)

processor = AutoProcessor.from_pretrained(model_id)

tokenizer = processor.tokenizer
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\n이 이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ£Όμ„Έμš”.<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n이 μ΄λ―Έμ§€μ—λŠ”")
image_file = "https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/NWfoArWI4UPAxpEnolkwT.jpeg"

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=400, do_sample=True, eos_token_id=terminators,)
print(processor.decode(output[0][2:], skip_special_tokens=False))

# --- Example Output [v1, Chat Vector] ---
user<|end_header_id|>

<image>
이 이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ£Όμ„Έμš”.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

이 μ΄λ―Έμ§€μ—λŠ” 고양이 ν•œ λ§ˆλ¦¬κ°€ κ°•λ¬Ό μœ„λ₯Ό κ±Έμ–΄κ°€λŠ” λͺ¨μŠ΅μ΄ λ³΄μ—¬μ§‘λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” κ°•λ¬Όμ˜ μž”λ¬Όκ²°μ— λ―Έλ„λŸΌμ„ 타고 κ°• κ°€λ‘œλ₯Ό μ§€λ‚˜λŠ” 데 λŠ₯μˆ™ν•˜κ²Œ λ³΄μž…λ‹ˆλ‹€. κ³ μ–‘μ΄μ˜ λ°œμ€ κ°•λ¬Όλ‘œ 잘 λ“€μ–΄κ°€, 그것을 즐기며 κ±Έμ–΄κ°‘λ‹ˆλ‹€. 

λ˜ν•œ 이 이미지도 μŒμ„± λ…ΉμŒμ„ ν•˜κ±°λ‚˜ λ…Ήν™”λœ 자료둜 μ œμž‘λ˜μ—ˆμœΌλ©°, 주둜 κ³ μ–‘μ΄μ˜ λͺ¨μŠ΅μ„ κ°•ν•˜κ²Œ λ³΄μ—¬μ€λ‹ˆλ‹€. μ†Œλ¦¬ νš¨κ³Όλ„ μ—¬λŸ¬ κ°€μ§€λ‘œ μΆ”κ°€ν•˜μ—¬ κ³ μ–‘μ΄μ˜ μŠ€ν† λ¦¬λ₯Ό λ‹€μ–‘ν•˜κ²Œ μ „λ‹¬ν•©λ‹ˆλ‹€. 강물은 μž”λ¬Όκ²°μ„ λ‚˜νƒ€λ‚΄λ©° κ°•λ¬Ό μœ„λ₯Ό κ±·λŠ” κ³ μ–‘μ΄μ˜ λͺ¨μŠ΅μ„ λ”μš± κ°•λ ¬ν•˜κ²Œ κ°•μ‘°ν•˜κΈ° μœ„ν•΄ μž”λ¬Όκ²°μ„ 톡해 더 λ””ν…ŒμΌν•œ μž₯면을 λ³΄μ—¬μ€λ‹ˆλ‹€.<|eot_id|>

# --- Example Output [v1-1, Chat Vector] ---
user<|end_header_id|>

<image>
이 이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ£Όμ„Έμš”.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

이 μ΄λ―Έμ§€μ—μ„œλŠ” ν•œ 고양이가 μ„œν•΄μ•ˆμ— μœ„μΉ˜ν•œ λ°”λ‹€λ₯Ό κ±·κ³  μžˆλŠ” λͺ¨μŠ΅μ„ λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” ν•΄λ³€μ—μ„œλΆ€ν„° λ°”λ‹€λ‘œ κ±Έμ–΄λ“€μ–΄κ°€λŠ” 쀑이며, μ£Όλ³€μ—λŠ” μž”μž”ν•œ νŒŒλ„κ°€ λ°€λ €μ˜€λŠ” λͺ¨μŠ΅μ„ 보여주고 μžˆμŠ΅λ‹ˆλ‹€. 이 κ³ μ–‘μ΄λŠ” νƒœμ–΄λ‚  λ•ŒλΆ€ν„° 고양이와 κ°•μ•„μ§€μ™€λŠ” λ‹€λ₯΄κ²Œ λ°”λ‹€λ₯Ό κ²½ν—˜ν•˜κ³ , 적응해가고 μžˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ°”λ‹€λ₯Ό μ’‹μ•„ν•˜κ³ , 이 ν™˜κ²½μ—μ„œ 행볡을 λŠλΌλŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. 이 κ³ μ–‘μ΄λŠ” 인간이 μ•„λ‹Œ μžμ—°μ˜ μΌλΆ€λ‘œμ¨ 이 ν™˜κ²½μ—μ„œ μ‚΄μ•„κ°€κ³  μžˆμŠ΅λ‹ˆλ‹€.<|eot_id|>

# --- Example Output [v1-2, Chat Vector] ---
# model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.7, eos_token_id=terminators,)
user<|end_header_id|>

<image>
이 이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ£Όμ„Έμš”.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

이 μ΄λ―Έμ§€λŠ” ν•œ 고양이가 λ¬Ό μœ„λ₯Ό κ±·κ³  μžˆλŠ” λͺ¨μŠ΅μ„ ν¬μ°©ν•œ μ‚¬μ§„μž…λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” 두 발둜 λ¬Ό μœ„λ₯Ό κ±Έμ–΄ κ°€κ³  μžˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” 4개의 발 쀑 2개의 λ°œμ€ 물에 빠지지 μ•Šκ³  2개의 λ°œμ€ 물에 λΉ μ Έ μžˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄μ˜ 발이 빠진 뢀뢄은 λ°˜μ˜λ˜μ–΄ 물에 비쳐 μžˆμŠ΅λ‹ˆλ‹€. λ¬Ό μœ„λ₯Ό κ±·λŠ” κ³ μ–‘μ΄μ˜ λͺ¨μŠ΅μ΄ 참으둜 κ·€μ—½κ³  μ‚¬λž‘μŠ€λŸ½μŠ΅λ‹ˆλ‹€. 이 사진은 KBS λ™λ¬Όμ˜ μ™•κ΅­μ—μ„œ λ°©μ˜λ˜μ—ˆμŠ΅λ‹ˆλ‹€. KBS λ™λ¬Όμ˜ 왕ꡭ은 1985λ…„λΆ€ν„° μ‹œμž‘ν•˜μ—¬ 2019λ…„κΉŒμ§€ 34λ…„ λ™μ•ˆ 방영된 KBS의 λŒ€ν‘œμ μΈ μžμ—° λ‹€νλ©˜ν„°λ¦¬ ν”„λ‘œκ·Έλž¨μž…λ‹ˆλ‹€. KBS λ™λ¬Όμ˜ 왕ꡭ은 λ™λ¬Όμ˜ μƒνƒœμ™€ μŠ΅μ„±, 행동, 그리고 μžμ—° ν™˜κ²½μ„ μ΄ν•΄ν•˜κ³  λ³΄ν˜Έν•˜λŠ” 데 κΈ°μ—¬ν•˜κ³ μž ν•©λ‹ˆλ‹€.

# --- Example Output [v2, Model diff based merging] ---
user<|end_header_id|>

<image>
이 이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ£Όμ„Έμš”.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

이 μ΄λ―Έμ§€μ—λŠ” ν•œκ΅­μ–΄ μžλ§‰κ³Ό ν•¨κ»˜ 고양이가 물에 λ°œμ„ λ””λ””κ³  κ±·λŠ” λͺ¨μŠ΅μ΄ 담겨 μžˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” 였λ₯Έμͺ½ λ°œμ„ 물에 λ‹΄κ·Έκ³  κ±·λŠ” 쀑이며, ν•œκ΅­μ–΄ μžλ§‰μ€ "κ³ μ–‘μ΄λŠ” 물을 μ’‹μ•„ν•©λ‹ˆλ‹€"λΌλŠ” λ¬Έμž₯을 ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이 μžλ§‰μ€ 고양이가 물을 μ’‹μ•„ν•˜λŠ” 것을 κ°•μ‘°ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.<|eot_id|>
```