limhyeonseok's picture
Update README.md
f361a34 verified
|
raw
history blame
No virus
9.08 kB
---
language:
- en
- ko
license: llama3
library_name: transformers
base_model:
- meta-llama/Meta-Llama-3-8B
---
<a href="https://github.com/MLP-Lab/Bllossom">
<img src="https://github.com/teddysum/bllossom/blob/main//bllossom_icon.png?raw=true" width="40%" height="50%">
</a>
# Bllossom | [Demo]() | [Homepage](https://www.bllossom.ai/) | [Github](https://github.com/MLP-Lab/Bllossom) |
[GPU용 Colab μ½”λ“œμ˜ˆμ œ](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing) |
[CPU용 Colab μ–‘μžν™”λͺ¨λΈ μ½”λ“œμ˜ˆμ œ](https://colab.research.google.com/drive/129ZNVg5R2NPghUEFHKF0BRdxsZxinQcJ?usp=drive_link)
```bash
저희 BllossomνŒ€ μ—μ„œ ν•œκ΅­μ–΄-μ˜μ–΄ 이쀑 μ–Έμ–΄λͺ¨λΈμΈ Bllossom을 κ³΅κ°œν–ˆμŠ΅λ‹ˆλ‹€!
μ„œμšΈκ³ΌκΈ°λŒ€ μŠˆνΌμ»΄ν“¨νŒ… μ„Όν„°μ˜ μ§€μ›μœΌλ‘œ 100GBκ°€λ„˜λŠ” ν•œκ΅­μ–΄λ‘œ λͺ¨λΈμ „체λ₯Ό ν’€νŠœλ‹ν•œ ν•œκ΅­μ–΄ κ°•ν™” 이쀑언어 λͺ¨λΈμž…λ‹ˆλ‹€!
ν•œκ΅­μ–΄ μž˜ν•˜λŠ” λͺ¨λΈ μ°Ύκ³  μžˆμ§€ μ•ŠμœΌμ…¨λ‚˜μš”?
- ν•œκ΅­μ–΄ 졜초! 무렀 3λ§Œκ°œκ°€ λ„˜λŠ” ν•œκ΅­μ–΄ μ–΄νœ˜ν™•μž₯
- Llama3λŒ€λΉ„ λŒ€λž΅ 25% 더 κΈ΄ 길이의 ν•œκ΅­μ–΄ Context μ²˜λ¦¬κ°€λŠ₯
- ν•œκ΅­μ–΄-μ˜μ–΄ Pararell Corpusλ₯Ό ν™œμš©ν•œ ν•œκ΅­μ–΄-μ˜μ–΄ 지식연결 (μ‚¬μ „ν•™μŠ΅)
- ν•œκ΅­μ–΄ λ¬Έν™”, μ–Έμ–΄λ₯Ό κ³ λ €ν•΄ μ–Έμ–΄ν•™μžκ°€ μ œμž‘ν•œ 데이터λ₯Ό ν™œμš©ν•œ λ―Έμ„Έμ‘°μ •
- κ°•ν™”ν•™μŠ΅
이 λͺ¨λ“ κ²Œ ν•œκΊΌλ²ˆμ— 적용되고 상업적 이용이 κ°€λŠ₯ν•œ Bllossom을 μ΄μš©ν•΄ μ—¬λŸ¬λΆ„ 만의 λͺ¨λΈμ„ λ§Œλ“€μ–΄λ³΄μ„Έμš₯!
무렀 Colab 무료 GPU둜 ν•™μŠ΅μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€. ν˜Ήμ€ μ–‘μžν™” λͺ¨λΈλ‘œ CPUμ—μ˜¬λ €λ³΄μ„Έμš” [μ–‘μžν™”λͺ¨λΈ](https://huggingface.co/MLP-KTLim/llama-3-Korean-Bllossom-8B-4bit)
1. Bllossom-8BλŠ” μ„œμšΈκ³ΌκΈ°λŒ€, ν…Œλ””μΈ, μ—°μ„ΈλŒ€ μ–Έμ–΄μžμ› μ—°κ΅¬μ‹€μ˜ μ–Έμ–΄ν•™μžμ™€ ν˜‘μ—…ν•΄ λ§Œλ“  μ‹€μš©μ£Όμ˜κΈ°λ°˜ μ–Έμ–΄λͺ¨λΈμž…λ‹ˆλ‹€! μ•žμœΌλ‘œ 지속적인 μ—…λ°μ΄νŠΈλ₯Ό 톡해 κ΄€λ¦¬ν•˜κ² μŠ΅λ‹ˆλ‹€ 많이 ν™œμš©ν•΄μ£Όμ„Έμš” πŸ™‚
2. 초 κ°•λ ₯ν•œ Advanced-Bllossom 8B, 70Bλͺ¨λΈ, μ‹œκ°-μ–Έμ–΄λͺ¨λΈμ„ λ³΄μœ ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€! (κΆκΈˆν•˜μ‹ λΆ„μ€ κ°œλ³„ μ—°λ½μ£Όμ„Έμš”!!)
3. Bllossom은 NAACL2024, LREC-COLING2024 (ꡬ두) λ°œν‘œλ‘œ μ±„νƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
4. 쒋은 μ–Έμ–΄λͺ¨λΈ 계속 μ—…λ°μ΄νŠΈ ν•˜κ² μŠ΅λ‹ˆλ‹€!! ν•œκ΅­μ–΄ κ°•ν™”λ₯Όμœ„ν•΄ 곡동 μ—°κ΅¬ν•˜μ‹€λΆ„(νŠΉνžˆλ…Όλ¬Έ) μ–Έμ œλ“  ν™˜μ˜ν•©λ‹ˆλ‹€!!
특히 μ†ŒλŸ‰μ˜ GPU라도 λŒ€μ—¬ κ°€λŠ₯ν•œνŒ€μ€ μ–Έμ œλ“  μ—°λ½μ£Όμ„Έμš”! λ§Œλ“€κ³  싢은거 λ„μ™€λ“œλ €μš”.
```
The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:
* **Knowledge Linking**: Linking Korean and English knowledge through additional training
* **Vocabulary Expansion**: Expansion of Korean vocabulary to enhance Korean expressiveness.
* **Instruction Tuning**: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
* **Human Feedback**: DPO has been applied
* **Vision-Language Alignment**: Aligning the vision transformer with this language model
**This model developed by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)**
## Demo Video
<div style="display: flex; justify-content: space-between;">
<!-- 첫 번째 컬럼 -->
<div style="width: 49%;">
<a>
<img src="https://github.com/lhsstn/lhsstn/blob/main/x-llava_dem.gif?raw=true" style="width: 100%; height: auto;">
</a>
<p style="text-align: center;">Bllossom-V Demo</p>
</div>
<!-- 두 번째 컬럼 (ν•„μš”ν•˜λ‹€λ©΄) -->
<div style="width: 49%;">
<a>
<img src="https://github.com/lhsstn/lhsstn/blob/main/bllossom_demo_kakao.gif?raw=true" style="width: 70%; height: auto;">
</a>
<p style="text-align: center;">Bllossom Demo(Kakao)γ…€γ…€γ…€γ…€γ…€γ…€γ…€γ…€</p>
</div>
</div>
## NEWS
* [2024.05.08] Vocab Expansion Model Update
* [2024.04.25] We released Bllossom v2.0, based on llama-3
* [2023/12] We released Bllossom-Vision v1.0, based on Bllossom
* [2023/08] We released Bllossom v1.0, based on llama-2.
* [2023/07] We released Bllossom v0.7, based on polyglot-ko.
## Example code
### Colab Tutorial
- [Inference-Code-Link](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing)
### Install Dependencies
```bash
pip install torch transformers==4.40.0 accelerate
```
### Python code with Pipeline
```python
import transformers
import torch
model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
pipeline.model.eval()
PROMPT = '''당신은 μœ μš©ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€. μ‚¬μš©μžμ˜ μ§ˆμ˜μ— λŒ€ν•΄ μΉœμ ˆν•˜κ³  μ •ν™•ν•˜κ²Œ λ‹΅λ³€ν•΄μ•Ό ν•©λ‹ˆλ‹€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "μ„œμšΈκ³Όν•™κΈ°μˆ λŒ€ν•™κ΅ MLP연ꡬ싀에 λŒ€ν•΄ μ†Œκ°œν•΄μ€˜"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(outputs[0]["generated_text"][len(prompt):])
# μ„œμšΈκ³Όν•™κΈ°μˆ λŒ€ν•™κ΅ MLP연ꡬ싀은 λ©€ν‹°λͺ¨λ‹¬ μžμ—°μ–΄μ²˜λ¦¬ 연ꡬλ₯Ό ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ꡬ성원은 μž„κ²½νƒœ κ΅μˆ˜μ™€ κΉ€λ―Όμ€€, 김상민, 졜창수, μ›μΈν˜Έ, μœ ν•œκ²°, μž„ν˜„μ„, μ†‘μŠΉμš°, μœ‘μ •ν›ˆ, μ‹ λ™μž¬ 학생이 μžˆμŠ΅λ‹ˆλ‹€.
```
### Python code with AutoModel
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'MLP-KTLim/llama-3-Korean-Bllossom-8B'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
PROMPT = '''당신은 μœ μš©ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€. μ‚¬μš©μžμ˜ μ§ˆμ˜μ— λŒ€ν•΄ μΉœμ ˆν•˜κ³  μ •ν™•ν•˜κ²Œ λ‹΅λ³€ν•΄μ•Ό ν•©λ‹ˆλ‹€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "μ„œμšΈκ³Όν•™κΈ°μˆ λŒ€ν•™κ΅ MLP연ꡬ싀에 λŒ€ν•΄ μ†Œκ°œν•΄μ€˜"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# μ„œμšΈκ³Όν•™κΈ°μˆ λŒ€ν•™κ΅ MLP연ꡬ싀은 λ©€ν‹°λͺ¨λ‹¬ μžμ—°μ–΄μ²˜λ¦¬ 연ꡬλ₯Ό ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ꡬ성원은 μž„κ²½νƒœ κ΅μˆ˜μ™€ κΉ€λ―Όμ€€, 김상민, 졜창수, μ›μΈν˜Έ, μœ ν•œκ²°, μž„ν˜„μ„, μ†‘μŠΉμš°, μœ‘μ •ν›ˆ, μ‹ λ™μž¬ 학생이 μžˆμŠ΅λ‹ˆλ‹€.
```
## Citation
**Language Model**
```text
@misc{bllossom,
author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
year = {2024},
journal = {LREC-COLING 2024},
paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
},
}
```
**Vision-Language Model**
```text
@misc{bllossom-V,
author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
year = {2024},
publisher = {GitHub},
journal = {NAACL 2024 findings},
paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
},
}
```
## Contact
- μž„κ²½νƒœ(KyungTae Lim), Professor at Seoultech. `ktlim@seoultech.ac.kr`
- ν•¨μ˜κ· (Younggyun Hahm), CEO of Teddysum. `hahmyg@teddysum.ai`
- κΉ€ν•œμƒ˜(Hansaem Kim), Professor at Yonsei. `khss@yonsei.ac.kr`
## Contributor
- 졜창수(Chansu Choi), choics2623@seoultech.ac.kr
- 김상민(Sangmin Kim), sangmin9708@naver.com
- μ›μΈν˜Έ(Inho Won), wih1226@seoultech.ac.kr
- κΉ€λ―Όμ€€(Minjun Kim), mjkmain@seoultech.ac.kr
- μ†‘μŠΉμš°(Seungwoo Song), sswoo@seoultech.ac.kr
- μ‹ λ™μž¬(Dongjae Shin), dylan1998@seoultech.ac.kr
- μž„ν˜„μ„(Hyeonseok Lim), gustjrantk@seoultech.ac.kr
- μœ‘μ •ν›ˆ(Jeonghun Yuk), usually670@gmail.com
- μœ ν•œκ²°(Hangyeol Yoo), 21102372@seoultech.ac.kr
- μ†‘μ„œν˜„(Seohyun Song), alexalex225225@gmail.com