JminJ commited on
Commit
fb1e399
โ€ข
1 Parent(s): 9c51c95

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bad_text_classifier
2
+
3
+ ## Model ์†Œ๊ฐœ
4
+ ์ธํ„ฐ๋„ท ์ƒ์— ํผ์ ธ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋Œ“๊ธ€, ์ฑ„ํŒ…์ด ๋ฏผ๊ฐํ•œ ๋‚ด์šฉ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณต๊ฐœ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด label์„ ์ˆ˜์ •ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋“ค์„ ํ•ฉ์ณ ๊ตฌ์„ฑํ•ด finetuning์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ธ์ œ๋‚˜ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ™•ํžˆ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ์  ์–‘ํ•ดํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
5
+ ```
6
+ NOTE)
7
+ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ฐํž™๋‹ˆ๋‹ค.
8
+ ๋˜ํ•œ ํ•ด๋‹น ๋ชจ๋ธ์˜ ์˜๊ฒฌ์€ ์ œ ์˜๊ฒฌ๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ์ ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.
9
+ ```
10
+
11
+ ## Dataset
12
+ ### data label
13
+ * **0 : bad sentence**
14
+ * **1 : not bad sentence**
15
+ ### ์‚ฌ์šฉํ•œ dataset
16
+ * [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
17
+ * [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
18
+ ### dataset ๊ฐ€๊ณต ๋ฐฉ๋ฒ•
19
+ ๊ธฐ์กด ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹ˆ์˜€๋˜ ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ํ˜•ํƒœ๋กœ labeling์„ ๋‹ค์‹œ ํ•ด์ค€ ๋’ค, Korean HateSpeech Dataset์ค‘ label 1(not bad sentence)๋งŒ์„ ์ถ”๋ ค ๊ฐ€๊ณต๋œ Korean Unsmile Dataset์— ํ•ฉ์ณ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
20
+ </br>
21
+
22
+ **Korean Unsmile Dataset์— clean์œผ๋กœ labeling ๋˜์–ด์žˆ๋˜ ๋ฐ์ดํ„ฐ ์ค‘ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.**
23
+ * "~๋…ธ"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ ์ค‘, "์ด๊ธฐ", "๋…ธ๋ฌด"๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
24
+ * "์ข†", "๋ดŠ" ๋“ฑ ์„ฑ ๊ด€๋ จ ๋‰˜์•™์Šค๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
25
+ </br></br>
26
+
27
+ ## Model Training
28
+ * huggingface transformers์˜ ElectraForSequenceClassification๋ฅผ ์‚ฌ์šฉํ•ด finetuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
29
+ * ํ•œ๊ตญ์–ด ๊ณต๊ฐœ Electra ๋ชจ๋ธ ์ค‘ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
30
+ ### use model
31
+ * [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
32
+ * [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
33
+ * [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
34
+
35
+ ### how to train?
36
+ ```BASH
37
+ python codes/model_source/train_torch_sch.py \
38
+ --learning_rate=3e-06 \
39
+ --use_float_16=True \
40
+ --weight-decay=0.001 \
41
+ --base_save_ckpt_path=BASE_SAVE_CHPT_PATH \
42
+ --epochs=10 \
43
+ --batch_size=128 \
44
+ --model_type=MODEL_TYPE
45
+ ```
46
+ ### parameters
47
+ | parameter | type | description | default |
48
+ | ---------- | ---------- | ---------- | --------- |
49
+ | learning_rate | float | decise learning rate for train | 5e-05 |
50
+ | use_float_16 | bool | decise to apply float 16 or not | False |
51
+ | weight_decay | float | define weight decay lambda | None |
52
+ | base_ckpt_save_path | str | base path that will be saved trained checkpoints | None |
53
+ | epochs | int | full train epochs | 5 |
54
+ | batch_size | int | batch size using in train time | 64 |
55
+ | model_type | int | used to choose what electra model using for training | 0 |
56
+ ```
57
+ NOTE) train dataset, valid dataset์€ train_torch_sch.py ๋‚ด์˜ config ๋ถ€๋ถ„์—์„œ ์ง€์ •ํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
58
+ ```
59
+ </br>
60
+
61
+ ## How to use model?
62
+ ```PYTHON
63
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
64
+
65
+ model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
66
+ tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
67
+ ```
68
+ </br>
69
+
70
+ ## Predict model
71
+ ์‚ฌ์šฉ์ž๊ฐ€ ํ…Œ์ŠคํŠธ ํ•ด๋ณด๊ณ  ์‹ถ์€ ๋ฌธ์žฅ์„ ๋„ฃ์–ด predict๋ฅผ ์ˆ˜ํ–‰ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
72
+ ```BASH
73
+ python codes/model_source/utils/predict.py \
74
+ --input_text=INPUT_TEXT \
75
+ --base_ckpt=BASE_CKPT
76
+ ```
77
+ ### parameters
78
+ | parameter | type | description | default |
79
+ | ---------- | ---------- | ---------- | --------- |
80
+ | input_text | str | user input text | "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค. JminJ์ž…๋‹ˆ๋‹ค!" |
81
+ | base_ckpt | str | base path that saved trained checkpoints | False |
82
+ </br>
83
+
84
+ ## Model Valid Accuracy
85
+ | mdoel | accuracy |
86
+ | ---------- | ---------- |
87
+ | kcElectra_base_fp16_wd_custom_dataset | 0.8849 |
88
+ | tunibElectra_base_fp16_wd_custom_dataset | 0.8726 |
89
+ | koElectra_base_fp16_wd_custom_dataset | 0.8434 |
90
+ ```
91
+ Note)
92
+ ๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
93
+ ```
94
+ </br>
95
+
96
+ ## Contact
97
+ * jminju254@gmail.com
98
+ </br></br>
99
+
100
+ ## Github
101
+ * https://github.com/JminJ/Bad_text_classifier
102
+ </br></br>
103
+
104
+ ## Reference
105
+ * [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
106
+ * [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
107
+ * [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
108
+ * [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
109
+ * [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
110
+ * [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)