Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Bad_text_classifier

Model ์†Œ๊ฐœ

์ธํ„ฐ๋„ท ์ƒ์— ํผ์ ธ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋Œ“๊ธ€, ์ฑ„ํŒ…์ด ๋ฏผ๊ฐํ•œ ๋‚ด์šฉ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณต๊ฐœ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด label์„ ์ˆ˜์ •ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋“ค์„ ํ•ฉ์ณ ๊ตฌ์„ฑํ•ด finetuning์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ธ์ œ๋‚˜ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ™•ํžˆ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ์  ์–‘ํ•ดํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

NOTE)
๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ฐํž™๋‹ˆ๋‹ค.
๋˜ํ•œ ํ•ด๋‹น ๋ชจ๋ธ์˜ ์˜๊ฒฌ์€ ์ œ ์˜๊ฒฌ๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ์ ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.

Dataset

data label

  • 0 : bad sentence
  • 1 : not bad sentence

์‚ฌ์šฉํ•œ dataset

dataset ๊ฐ€๊ณต ๋ฐฉ๋ฒ•

๊ธฐ์กด ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹ˆ์˜€๋˜ ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ํ˜•ํƒœ๋กœ labeling์„ ๋‹ค์‹œ ํ•ด์ค€ ๋’ค, Korean HateSpeech Dataset์ค‘ label 1(not bad sentence)๋งŒ์„ ์ถ”๋ ค ๊ฐ€๊ณต๋œ Korean Unsmile Dataset์— ํ•ฉ์ณ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Korean Unsmile Dataset์— clean์œผ๋กœ labeling ๋˜์–ด์žˆ๋˜ ๋ฐ์ดํ„ฐ ์ค‘ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • "~๋…ธ"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ ์ค‘, "์ด๊ธฐ", "๋…ธ๋ฌด"๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
  • "์ข†", "๋ดŠ" ๋“ฑ ์„ฑ ๊ด€๋ จ ๋‰˜์•™์Šค๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •

Model Training

  • huggingface transformers์˜ ElectraForSequenceClassification๋ฅผ ์‚ฌ์šฉํ•ด finetuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด ๊ณต๊ฐœ Electra ๋ชจ๋ธ ์ค‘ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

use model

How to use model?

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')

Model Valid Accuracy

mdoel accuracy
kcElectra_base_fp16_wd_custom_dataset 0.8849
tunibElectra_base_fp16_wd_custom_dataset 0.8726
koElectra_base_fp16_wd_custom_dataset 0.8434
Note)
๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Contact

Github

Reference

Downloads last month
23
Safetensors
Model size
125M params
Tensor type
I64
ยท
F32
ยท
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.