File size: 3,995 Bytes

8fce01d
 
be423b2
 
 
 
8fce01d
be423b2
 
 
20c3afe
be423b2
 
 
20c3afe
be423b2
 
 
20c3afe
be423b2
20c3afe
 
 
be423b2
20c3afe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be423b2
20c3afe
 
 
 
 
 
 
 
 
 
 
 
 
be423b2

---
license: apache-2.0
datasets:
- brian-lim/smile_style_orca
language:
- ko
---
# Korean Style Transfer

This model is a fine-tuned version of [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) using a Korean style dataset provided by Smilegate AI (https://github.com/smilegate-ai/korean_smile_style_dataset/tree/main).
Since the original dataset is tabular and not fit for training the LLM, I have preprocessed it into an instruction-input-output format, which can be found [here](https://huggingface.co/datasets/brian-lim/smile_style_orca).
The dataset is then fed into the ChatML template. Feel free to use my version of the dataset as needed.

해당 모델은 [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) 모델을 스마일게이트 AI에서 제공하는 Smile style 데이터셋으로 파인튜닝 했습니다.
기존 데이터셋은 테이블 형태로 되어있어 해당 데이터를 instruction-input-output 형태로 만들었고, [여기](https://huggingface.co/datasets/brian-lim/smile_style_orca)에서 확인 가능합니다.
데이터셋을 불러온 뒤 ChatML 형식에 맞춰 훈련 데이터 구축을 한 뒤 진행했습니다. 필요하시다면 자유롭게 사용하시기 바랍니다.


# How to use

```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained('brian-lim/smile-style-transfer')
model = AutoModelForCausalLM.from_pretrained('brian-lim/smile-style-transfer', device_map = device)

prompts = {'informal': '주어진 글을 가능한 형식적이지 않고 딱딱하지 않은 대화체로 바꿔줘.',
          'android': '주어진 글을 가능한 안드로이드 로봇과 같은 대화체로 바꿔줘.',
          'azae': '주어진 글을 가능한 아저씨같은 말투로 바꿔줘.',
          'chat': '주어진 글을 가능한 인터넷상에 사용하는 말투로 바꿔줘.',
          'choding': '주어진 글을 가능한 초등학생처럼 짧게 줄인 대화체로 바꿔줘.',
          'emoticon': '주어진 글을 가능한 이모티콘이 들어간 대화체로 바꿔줘.',
          'enfp': '주어진 글을 가능한 활기차면서 공감을 많이 하는 친절한 대화체로 바꿔줘.',
          'gentle' : '주어진 글을 가능한 “요”로 끝나지 않으면서 깔끔한 대화체로 바꿔줘.',
          'halbae' : '주어진 글을 가능한 연륜이 있는 할아버지 같은 맡투로 바꿔줘.',
          'halmae' : '주어진 글을 가능한 비속어가 들어가는 할머니 같은 맡투로 바꿔줘.',
          'joongding': '주어진 글을 가능한 중학교 2학년의 말투로 바꿔줘.',
          'king': '주어진 글을 가능한 조선시대 왕의 말투로 바꿔줘.',
          'seonbi': '주어진 글을 가능한 조선시대 선비의 말투로 바꿔줘.',
          'sosim': '주어진 글을 가능한 아주 소심하고 조심스러운 말투로 바꿔줘.',
          'translator': '주어진 글을 가능한 어색한 한국어 번역 말투로 바꿔줘.',
          }
query = '[INPUT]: 안녕하세요. 요즘 날씨가 많이 쌀쌀하네요 \n[OUTPUT]: '

input_query = prompts['king'] + query
input_tokenized = tokenizer(input_query,return_tensors="pt").to(device)

g_config = GenerationConfig(temperature=0.3,
                        repetition_penalty=1.2,
                        max_new_tokens=768,
                        do_sample=True,
                        )
output = model.generate(**input_tokenized,
                      generation_config=g_config,        
                      pad_token_id=tokenizer.eos_token_id,
                      eos_token_id=tokenizer.eos_token_id,)
output_text = tokenizer.decode(output.detach().cpu().numpy()[0])
output_text = output_text[output_text.find('[OUTPUT]'):]
print(output_text)
```


---
license: apache-2.0
---