jujbob's picture
Update README.md
252f3f9 verified
|
raw
history blame
No virus
8.06 kB
---
language:
- en
- ko
license: llama3
library_name: transformers
base_model:
- meta-llama/Meta-Llama-3-8B
---
<a href="https://github.com/MLP-Lab/Bllossom">
<img src="https://github.com/teddysum/bllossom/blob/main//bllossom_icon.png?raw=true" width="40%" height="50%">
</a>
# Bllossom | [Demo](https://59d690fb9d037d5250.gradio.live/) | [Homepage](https://www.bllossom.ai/) | [Github](https://github.com/MLP-Lab/Bllossom) | [Colab-tutorial](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing) |
The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:
* **Knowledge Linking**: Linking Korean and English knowledge through additional training
* **Vocabulary Expansion**: Expansion of Korean vocabulary to enhance Korean expressiveness.
* **Instruction Tuning**: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
* **Human Feedback**: DPO has been applied
* **Vision-Language Alignment**: Aligning the vision transformer with this language model
**This model developed by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)**
## Demo Video
<div style="display: flex; justify-content: space-between;">
<!-- ์ฒซ ๋ฒˆ์งธ ์ปฌ๋Ÿผ -->
<div style="width: 49%;">
<a>
<img src="https://github.com/lhsstn/lhsstn/blob/main/x-llava_dem.gif?raw=true" style="width: 100%; height: auto;">
</a>
<p style="text-align: center;">Bllossom-V Demo</p>
</div>
<!-- ๋‘ ๋ฒˆ์งธ ์ปฌ๋Ÿผ (ํ•„์š”ํ•˜๋‹ค๋ฉด) -->
<div style="width: 49%;">
<a>
<img src="https://github.com/lhsstn/lhsstn/blob/main/bllossom_demo_kakao.gif?raw=true" style="width: 70%; height: auto;">
</a>
<p style="text-align: center;">Bllossom Demo(Kakao)ใ…คใ…คใ…คใ…คใ…คใ…คใ…คใ…ค</p>
</div>
</div>
## NEWS
* [2024/04] We released Bllossom v2.0, based on llama-3
* [2023/12] We released Bllossom-Vision v1.0, based on Bllossom
* [2023/08] We released Bllossom v1.0, based on llama-2.
* [2023/07] We released Bllossom v0.7, based on polyglot-ko.
```bash
์ €ํฌ ์„œ์šธ๊ณผ๊ธฐ๋Œ€ MLP์—ฐ๊ตฌ์‹ค์—์„œ ํ•œ๊ตญ์–ด-์˜์–ด ์ด์ค‘ ์–ธ์–ด๋ชจ๋ธ์ธ Bllossom์„ ๊ณต๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค!
- LLama3-8B ๊ธฐ๋ฐ˜์˜ ๊ฒฝ๋Ÿ‰ํ™”๋œ ์‚ฌ์ด์ฆˆ
- ํ•œ๊ตญ์–ด-์˜์–ด ์ง€์‹์—ฐ๊ฒฐ์„ ํ†ตํ•œ ํ•œ๊ตญ์–ด ์ง€์‹ ๊ฐ•ํ™”
- ํ•œ๊ตญ์–ด ์–ดํœ˜์ถ”๊ฐ€
- ํ•œ๊ตญ์–ด ๋ฌธํ™”, ์–ธ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์ž์ฒด์ œ์ž‘ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ฏธ์„ธ์กฐ์ •
- ๊ฐ•ํ™”ํ•™์Šต (DPO)
- ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธํ™•์žฅ
1. Bllossom์€ ์„œ์šธ๊ณผ๊ธฐ๋Œ€, ํ…Œ๋””์ธ, ์—ฐ์„ธ๋Œ€ ์–ธ์–ด์ž์› ์—ฐ๊ตฌ์‹ค์˜ ์–ธ์–ดํ•™์ž์™€ ํ˜‘์—…ํ•ด ๋งŒ๋“  ์‹ค์šฉ์ฃผ์˜๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์ž…๋‹ˆ๋‹ค! ์•ž์œผ๋กœ ์ง€์†์ ์ธ ์—…๋ฐ์ดํŠธ๋ฅผ ํ†ตํ•ด ๊ด€๋ฆฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค ๋งŽ์ด ํ™œ์šฉํ•ด์ฃผ์„ธ์š” ๐Ÿ™‚
2. Bllossom70B๋ชจ๋ธ, ์–ดํœ˜ํ™•์žฅ๋ชจ๋ธ, ์‹œ๊ฐ-์–ธ์–ด๋ชจ๋ธ์€ ์ถ”ํ›„ ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. (๊ถ๊ธˆํ•˜์‹ ๋ถ„์€ ๊ฐœ๋ณ„ ์—ฐ๋ฝ์ฃผ์„ธ์š”, GPU๋งŒ ์ง€์›ํ•ด์ฃผ์‹œ๋ฉด ๋ฌด๋ฃŒ๋กœ ๋“œ๋ฆฝ๋‹ˆ๋‹ค!)
3. Bllossom์€ NAACL2024, LREC-COLING2024 (๊ตฌ๋‘) ๋ฐœํ‘œ๋กœ ์ฑ„ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
4. ์ข‹์€ ์–ธ์–ด๋ชจ๋ธ ๊ณ„์† ์—…๋ฐ์ดํŠธ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!! ํ•œ๊ตญ์–ด ๊ฐ•ํ™”๋ฅผ์œ„ํ•ด ๊ณต๋™ ์—ฐ๊ตฌํ•˜์‹ค๋ถ„ ์–ธ์ œ๋“  ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค!!
```
## Example code
### Colab Tutorial
- [Inference-Code-Link](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing)
### Install Dependencies
```bash
pip install torch transformers==4.40.0 accelerate
```
### Python code with Pipeline
```python
import transformers
import torch
model_id = "MLP-KTLim/llama3-Bllossom"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
pipeline.model.eval()
PROMPT = '''๋‹น์‹ ์€ ์œ ์šฉํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์งˆ์˜์— ๋Œ€ํ•ด ์นœ์ ˆํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.'''
instruction = "์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด์ค˜"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
repetition_penalty = 1.1
)
print(outputs[0]["generated_text"][len(prompt):])
# ์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์—ฐ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์›์€ ์ž„๊ฒฝํƒœ ๊ต์ˆ˜์™€ ๊น€๋ฏผ์ค€, ๊น€์ƒ๋ฏผ, ์ตœ์ฐฝ์ˆ˜, ์›์ธํ˜ธ, ์œ ํ•œ๊ฒฐ, ์ž„ํ˜„์„, ์†ก์Šน์šฐ, ์œก์ •ํ›ˆ, ์‹ ๋™์žฌ ํ•™์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
```
### Python code with AutoModel
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'MLP-KTLim/llama3-Bllossom'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
PROMPT = '''๋‹น์‹ ์€ ์œ ์šฉํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์งˆ์˜์— ๋Œ€ํ•ด ์นœ์ ˆํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.'''
instruction = "์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด์ค˜"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
repetition_penalty = 1.1
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# ์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์—ฐ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์›์€ ์ž„๊ฒฝํƒœ ๊ต์ˆ˜์™€ ๊น€๋ฏผ์ค€, ๊น€์ƒ๋ฏผ, ์ตœ์ฐฝ์ˆ˜, ์›์ธํ˜ธ, ์œ ํ•œ๊ฒฐ, ์ž„ํ˜„์„, ์†ก์Šน์šฐ, ์œก์ •ํ›ˆ, ์‹ ๋™์žฌ ํ•™์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
```
## Citation
**Language Model**
```text
@misc{bllossom,
author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
year = {2024},
journal = {LREC-COLING 2024},
paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
},
}
```
**Vision-Language Model**
```text
@misc{bllossom-V,
author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
year = {2024},
publisher = {GitHub},
journal = {NAACL 2024 findings},
paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
},
}
```
## Contact
- ์ž„๊ฒฝํƒœ(KyungTae Lim), Professor at Seoultech. `ktlim@seoultech.ac.kr`
- ํ•จ์˜๊ท (Younggyun Hahm), CEO of Teddysum. `hahmyg@teddysum.ai`
- ๊น€ํ•œ์ƒ˜(Hansaem Kim), Professor at Yonsei. `khss@yonsei.ac.kr`
## Contributor
- ์ตœ์ฐฝ์ˆ˜(Chansu Choi), choics2623@seoultech.ac.kr
- ๊น€์ƒ๋ฏผ(Sangmin Kim), sangmin9708@naver.com
- ์›์ธํ˜ธ(Inho Won), wih1226@seoultech.ac.kr
- ๊น€๋ฏผ์ค€(Minjun Kim), mjkmain@seoultech.ac.kr
- ์†ก์Šน์šฐ(Seungwoo Song), sswoo@seoultech.ac.kr
- ์‹ ๋™์žฌ(Dongjae Shin), dylan1998@seoultech.ac.kr
- ์ž„ํ˜„์„(Hyeonseok Lim), gustjrantk@seoultech.ac.kr
- ์œก์ •ํ›ˆ(Jeonghun Yuk), usually670@gmail.com
- ์œ ํ•œ๊ฒฐ(Hangyeol Yoo), 21102372@seoultech.ac.kr
- ์†ก์„œํ˜„(Seohyun Song), alexalex225225@gmail.com