JminJ's picture
Update README.md
3ad9067
# Bad_text_classifier
## Model ์†Œ๊ฐœ
์ธํ„ฐ๋„ท ์ƒ์— ํผ์ ธ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋Œ“๊ธ€, ์ฑ„ํŒ…์ด ๋ฏผ๊ฐํ•œ ๋‚ด์šฉ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณต๊ฐœ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด label์„ ์ˆ˜์ •ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋“ค์„ ํ•ฉ์ณ ๊ตฌ์„ฑํ•ด finetuning์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ธ์ œ๋‚˜ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ™•ํžˆ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ์  ์–‘ํ•ดํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
```
NOTE)
๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ฐํž™๋‹ˆ๋‹ค.
๋˜ํ•œ ํ•ด๋‹น ๋ชจ๋ธ์˜ ์˜๊ฒฌ์€ ์ œ ์˜๊ฒฌ๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ์ ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.
```
## Dataset
### data label
* **0 : bad sentence**
* **1 : not bad sentence**
### ์‚ฌ์šฉํ•œ dataset
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
### dataset ๊ฐ€๊ณต ๋ฐฉ๋ฒ•
๊ธฐ์กด ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹ˆ์˜€๋˜ ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ํ˜•ํƒœ๋กœ labeling์„ ๋‹ค์‹œ ํ•ด์ค€ ๋’ค, Korean HateSpeech Dataset์ค‘ label 1(not bad sentence)๋งŒ์„ ์ถ”๋ ค ๊ฐ€๊ณต๋œ Korean Unsmile Dataset์— ํ•ฉ์ณ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
</br>
**Korean Unsmile Dataset์— clean์œผ๋กœ labeling ๋˜์–ด์žˆ๋˜ ๋ฐ์ดํ„ฐ ์ค‘ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.**
* "~๋…ธ"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ ์ค‘, "์ด๊ธฐ", "๋…ธ๋ฌด"๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
* "์ข†", "๋ดŠ" ๋“ฑ ์„ฑ ๊ด€๋ จ ๋‰˜์•™์Šค๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
</br>
## Model Training
* huggingface transformers์˜ ElectraForSequenceClassification๋ฅผ ์‚ฌ์šฉํ•ด finetuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
* ํ•œ๊ตญ์–ด ๊ณต๊ฐœ Electra ๋ชจ๋ธ ์ค‘ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
### use model
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
## How to use model?
```PYTHON
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained('JminJ/tunibElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/tunibElectra_base_Bad_Sentence_Classifier')
```
## Model Valid Accuracy
| mdoel | accuracy |
| ---------- | ---------- |
| kcElectra_base_fp16_wd_custom_dataset | 0.8849 |
| tunibElectra_base_fp16_wd_custom_dataset | 0.8726 |
| koElectra_base_fp16_wd_custom_dataset | 0.8434 |
```
Note)
๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
```
## Contact
* jminju254@gmail.com
</br></br>
## Github
* https://github.com/JminJ/Bad_text_classifier
</br></br>
## Reference
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)