ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ (KoBERT ๊ธฐ๋ฐ˜)

Model Description

์ด ๋ชจ๋ธ์€ AIHub์˜ ๊ฐ์„ฑ๋Œ€ํ™” ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๊ตญ์–ด ๋Œ€ํ™”์˜ ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•˜๋Š” KoBERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์›๋ณธ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค์‹œ 5๊ฐœ์˜ ๊ฐ์ • ๋ฒ”์ฃผ๋กœ ๋ ˆ์ด๋ธ”๋งํ•˜์˜€์œผ๋ฉฐ, Hugging Face์˜ Trainer ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

5๊ฐœ์˜ ๊ฐ์ • ๋ฒ”์ฃผ:

  • 0: Angry
  • 1: Fear
  • 2: Happy
  • 3: Tender
  • 4: Sad

Training Data

์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ AIHub์˜ ๊ฐ์„ฑ๋Œ€ํ™” ๋ง๋ญ‰์น˜์—์„œ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ๋กœ, ๋Œ€ํ™” ํ…์ŠคํŠธ๋ฅผ 5๊ฐœ์˜ ๋ฒ”์ฃผ๋กœ ๋ ˆ์ด๋ธ”๋งํ•˜์—ฌ ์ „์ฒ˜๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋Š” 80%๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ, ๋‚˜๋จธ์ง€ 20%๋Š” ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ„์–ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Pre-trained Model

์ด ๋ชจ๋ธ์€ monologg/kobert ์‚ฌ์ „ ํ•™์Šต๋œ KoBERT ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. KoBERT๋Š” ํ•œ๊ตญ์–ด BERT ๋ชจ๋ธ๋กœ์„œ, ์ด ํ”„๋กœ์ ํŠธ์—์„œ 5๊ฐœ์˜ ๊ฐ์ • ๋ฒ”์ฃผ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชฉ์ ์„ ์œ„ํ•ด ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer

๋ชจ๋ธ์€ AutoTokenizer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜์˜€์œผ๋ฉฐ, padding='max_length'์™€ truncation=True ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ๋Œ€ ๊ธธ์ด 128์˜ ์ž…๋ ฅ์œผ๋กœ ๋ณ€ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

tokenizer = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)

def tokenize_function(examples):
    return tokenizer(examples['input_text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

Model Architecture

๋ชจ๋ธ์€ BertForSequenceClassification ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 5๊ฐœ์˜ ๊ฐ์ • ๋ ˆ์ด๋ธ”์„ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.

model = BertForSequenceClassification.from_pretrained('monologg/kobert', num_labels=5)

Training Configuration

๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด Hugging Face์˜ Trainer ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•™์Šต ์„ค์ •์„ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค:

  • ํ•™์Šต๋ฅ : 2e-5
  • ๋ฐฐ์น˜ ํฌ๊ธฐ: 16
  • ์—ํฌํฌ ์ˆ˜: 10
  • ํ‰๊ฐ€ ์ „๋žต: ๋งค ์—ํฌํฌ๋งˆ๋‹ค ํ‰๊ฐ€
  • F1 ์Šค์ฝ”์–ด (macro) ๊ธฐ์ค€์œผ๋กœ ์ตœ์ ์˜ ๋ชจ๋ธ ์ €์žฅ
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True
)

How to Use the Model

๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด Hugging Face transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ KoBERT ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, BertForSequenceClassification

# KoBERT์˜ ์›๋ž˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉ
tokenizer = AutoTokenizer.from_pretrained('monologg/kobert')
model = BertForSequenceClassification.from_pretrained('jeonghyeon97/koBERT-Senti5')

# ์˜ˆ์‹œ ์ž…๋ ฅ (์—ฌ๋Ÿฌ ๋ฌธ์žฅ ๋ฆฌ์ŠคํŠธ)
texts = [
    "์˜ค๋Š˜์€ ์ •๋ง ํ–‰๋ณตํ•œ ํ•˜๋ฃจ์•ผ!",
    "์ด๊ฑฐ ์ •๋ง ์งœ์ฆ๋‚˜๊ณ  ํ™”๋‚œ๋‹ค.",
    "๊ทธ๋ƒฅ ๊ทธ๋ ‡๋„ค.",
    "์™œ ์ด๋ ‡๊ฒŒ ์Šฌํ”„์ง€?",
    "๊ธฐ๋ถ„์ด ์ข€ ๋ถˆ์•ˆํ•ด."
]

# ์ž…๋ ฅ ํ…์ŠคํŠธ ํ† ํฐํ™”
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# ์˜ˆ์ธก
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
for text, prediction in zip(texts, predictions):
    print(f"์ž…๋ ฅ: {text} -> ์˜ˆ์ธก๋œ ๊ฐ์ • ๋ ˆ์ด๋ธ”: {prediction.item()}")
Downloads last month
280
Safetensors
Model size
92.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for jeonghyeon97/koBERT-Senti5

Base model

monologg/kobert
Finetuned
(7)
this model