Hansung-Bllossom-8B / README.md
kfkas's picture
Update README.md
1851ea5 verified
|
raw
history blame
4.87 kB
metadata
language:
  - en
  - ko
license: llama3
library_name: transformers
base_model:
  - meta-llama/Meta-Llama-3-8B

Hansung Bllossom | Demo | Developer κΉ€νƒœλ―Ό | Github |

ν•œμ„±λŒ€ν•™κ΅ QA 기반으둜 ν•™μŠ΅μ‹œν‚¨Hansung-Bllossom-8B λ₯Ό μΆœμ‹œν•©λ‹ˆλ‹€.
μ΄λŠ” MLP-KTLim/llama-3-Korean-Bllossom-8B 을 기반으둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:

  • Knowledge Linking: Linking Korean and English knowledge through additional training
  • Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
  • Instruction Tuning: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
  • Human Feedback: DPO has been applied
  • Vision-Language Alignment: Aligning the vision transformer with this language model

Example code

Install Dependencies

pip install torch transformers==4.40.0 accelerate

Python code with Pipeline

import transformers
import torch

model_id = "kfkas/Hansung-Bllossom-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()

PROMPT = '''당신은 μœ μš©ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€. μ‚¬μš©μžμ˜ μ§ˆμ˜μ— λŒ€ν•΄ μΉœμ ˆν•˜κ³  μ •ν™•ν•˜κ²Œ λ‹΅λ³€ν•΄μ•Ό ν•©λ‹ˆλ‹€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "ν•œμ„±λŒ€ν•™κ΅μ—μ„œλŠ” μ–΄λ–€ μΆ•μ œλ‚˜ 행사가 μ—΄λ¦¬λ‚˜μš”?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(outputs[0]["generated_text"][len(prompt):])

Python code with AutoModel


import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'kfkas/Hansung-Bllossom-8B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''당신은 μœ μš©ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€. μ‚¬μš©μžμ˜ μ§ˆμ˜μ— λŒ€ν•΄ μΉœμ ˆν•˜κ³  μ •ν™•ν•˜κ²Œ λ‹΅λ³€ν•΄μ•Ό ν•©λ‹ˆλ‹€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "ν•œμ„±λŒ€ν•™κ΅λŠ” μ–Έμ œ μ„€λ¦½λ˜μ—ˆλ‚˜μš”?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

Citation

Language Model

@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}

Vision-Language Model

@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}

Contact

  • κΉ€νƒœλ―Ό(Taemin Kim), Intelligent System. taemin6697@gmail.com

Contributor

  • κΉ€νƒœλ―Ό(Taemin Kim), Intelligent System. taemin6697@gmail.com