Edit model card

korean Formal Convertor Using Deep Learning

์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์€ ํ•œ๊ตญ์–ด์—์„œ๋งŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค, ๋ณธ ๋ชจ๋ธ์€ ๋ฐ˜๋ง(informal)์„ ์กด๋Œ“๋ง(formal)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ณ€ํ™˜๊ธฐ(convertor) ์ž…๋‹ˆ๋‹ค.
*ํ™•๋ณดํ•œ ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์…‹์—๋Š” "ํ•ด์š”์ฒด"์™€ "ํ•ฉ์‡ผ์ฒด" ๋‘ ์ข…๋ฅ˜๊ฐ€ ์กด์žฌํ–ˆ์ง€๋งŒ ๋ณธ ๋ชจ๋ธ์€ "ํ•ด์š”์ฒด"๋กœ ํ†ต์ผํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ฉ์‡ผ์ฒด *ํ•ด์š”์ฒด
์•ˆ๋…•ํ•˜์‹ญ๋‹ˆ๊นŒ. ์•ˆ๋…•ํ•˜์„ธ์š”.
์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค. ์ข‹์€ ์•„์นจ์ด์—์š”.
๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์–ด์š”.

๋ฐฐ๊ฒฝ

  • ์ด์ „์— ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ(https://github.com/jongmin-oh/korean-formal-classifier) ๋ฅผ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
    ๋ถ„๋ฅ˜๊ธฐ๋กœ ๋งํˆฌ๋ฅผ ๋‚˜๋ˆ  ์‚ฌ์šฉํ•˜๋ คํ–ˆ์ง€๋งŒ, ์ƒ๋Œ€์ ์œผ๋กœ ์กด๋Œ“๋ง์˜ ๋น„์ค‘์ด ์ ์—ˆ๊ณ  ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ฐ”๊พธ์–ด ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์„ ๋Š˜๋ฆฌ๊ธฐ์œ„ํ•ด ๋งŒ๋“ค๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ

  • ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ๋Š” T5๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ Text2Text generation Task๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜์‹ค ๋ถ„๋“ค์€ ๋ฐ‘์— ์˜ˆ์ œ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด์„œ huggingFace ๋ชจ๋ธ('j5ng/et5-formal-convertor') ๋‹ค์šด๋ฐ›์•„ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Base on PLM model(ET5)

Base on Dataset

  • AIํ—ˆ๋ธŒ(https://www.aihub.or.kr/) : ํ•œ๊ตญ์–ด ์–ด์ฒด ๋ณ€ํ™˜ ์ฝ”ํผ์Šค

    1. KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋Œ€ํ™” 1,254 ๋ฌธ์žฅ
    2. ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ
  • ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)

Preprocessing

  1. ๋ฐ˜๋ง/์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ("ํ•ด์š”์ฒด"๋งŒ ๋ถ„๋ฆฌ)

    • ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋ฐ์ดํ„ฐ์—์„œ (['formal','informal']) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
    • ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์—์„œ [".ban", ".yo"] txt ํŒŒ์ผ๋งŒ ์‚ฌ์šฉ
    • KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋ฐ์ดํ„ฐ์—์„œ(["๋ฐ˜๋ง","ํ•ด์š”์ฒด"]) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
  2. ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ(3๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ)

  3. ๋งˆ์นจํ‘œ(.)์™€ ์‰ผํ‘œ(,)์ œ๊ฑฐ

  4. ๋ฐ˜๋ง(informal) ์นผ๋Ÿผ ์ค‘๋ณต ์ œ๊ฑฐ : 1632๊ฐœ ์ค‘๋ณต๋ฐ์ดํ„ฐ ์ œ๊ฑฐ

์ตœ์ข… ํ•™์Šต๋ฐ์ดํ„ฐ ์˜ˆ์‹œ

informal formal
์‘ ๊ณ ๋งˆ์›Œ ๋„ค ๊ฐ์‚ฌํ•ด์š”
๋‚˜๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์–ด ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด ์ €๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด์š”
๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด์•ผ ๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด๋„ค์š”
๊ดœ์ฐฎ๊ฒ ์–ด? ๊ดœ์ฐฎ์œผ์‹ค๊นŒ์š”?
์•„๋‹ˆ์•ผ ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด ์ค€๋น„ํ•ด์ค˜ ์•„๋‹ˆ์—์š” ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด์š” ์ค€๋น„ํ•ด์ฃผ์„ธ์š”

total : 14,992 ์Œ


How to use

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# T5 ๋ชจ๋ธ ๋กœ๋“œ
model = T5ForConditionalGeneration.from_pretrained("j5ng/et5-formal-convertor")
tokenizer = T5Tokenizer.from_pretrained("j5ng/et5-formal-convertor")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
# device = "mps:0" if torch.cuda.is_available() else "cpu" # for mac m1

model = model.to(device) 

# ์˜ˆ์‹œ ์ž…๋ ฅ ๋ฌธ์žฅ
input_text = "๋‚˜ ์ง„์งœ ํ™”๋‚ฌ์–ด ์ง€๊ธˆ"

# ์ž…๋ ฅ ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ
input_encoding = tokenizer("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text, return_tensors="pt")

input_ids = input_encoding.input_ids.to(device)
attention_mask = input_encoding.attention_mask.to(device)

# T5 ๋ชจ๋ธ ์ถœ๋ ฅ ์ƒ์„ฑ
output_encoding = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=128,
    num_beams=5,
    early_stopping=True,
)

# ์ถœ๋ ฅ ๋ฌธ์žฅ ๋””์ฝ”๋”ฉ
output_text = tokenizer.decode(output_encoding[0], skip_special_tokens=True)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print(output_text) # ์ € ์ง„์งœ ํ™”๋‚ฌ์Šต๋‹ˆ๋‹ค ์ง€๊ธˆ.

With Transformer Pipeline

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer, pipeline

model = T5ForConditionalGeneration.from_pretrained('j5ng/et5-formal-convertor')
tokenizer = T5Tokenizer.from_pretrained('j5ng/et5-formal-convertor')

typos_corrector = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    framework="pt",
)

input_text = "๋„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์–ด"
output_text = typos_corrector("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text,
            max_length=128,
            num_beams=5,
            early_stopping=True)[0]['generated_text']

print(output_text) # ๋‹น์‹ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

Thanks to

์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ์˜ ํ•™์Šต์€ ์ธ๊ณต์ง€๋Šฅ์‚ฐ์—…์œตํ•ฉ์‚ฌ์—…๋‹จ(AICA)์˜ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์ง€์›๋ฐ›์•„ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Downloads last month
301