File size: 3,901 Bytes
890e0dc e92823b 890e0dc e92823b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
license: apache-2.0
language:
- ko
pipeline_tag: text-classification
---
# formal_classifier
formal classifier or honorific classifier
## νκ΅μ΄ μ‘΄λλ§ λ°λ§ λΆλ₯κΈ°
μ€λμ μ μ‘΄λλ§ , λ°λ§μ νκ΅μ΄ ννμ λΆμκΈ°λ‘ λΆλ₯νλ κ°λ¨ν λ°©λ²μ μκ°νλ€.<br>
νμ§λ§ μ΄ λ°©λ²μ μ€μ λ‘ μ μ©νλ € νλλ, λ§μ λΆλΆμμ μ€λ₯κ° λ°μνμλ€.
μλ₯Ό λ€λ©΄)
```bash
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?
```
λΌλ 문ꡬλ₯Ό "κ»μ"λΌλ μ‘΄μΉλλ¬Έμ μ 체문μ₯μ μ‘΄λλ§λ‘ νλ¨νλ μ€λ₯κ° λ§μ΄ λ°μνλ€. <br>
κ·Έλμ μ΄λ²μ λ₯λ¬λ λͺ¨λΈμ λ§λ€κ³ κ·Έ κ³Όμ μ 곡μ ν΄λ³΄κ³ μνλ€.
#### λΉ λ₯΄κ² κ°μ Έλ€ μ°μ€ λΆλ€μ μλ μ½λλ‘ λ°λ‘ μ¬μ©νμ€ μ μμ΅λλ€.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?"))
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
```
***
### λ°μ΄ν° μ
μΆμ²
#### μ€λ§μΌκ²μ΄νΈ λ§ν¬ λ°μ΄ν° μ
(korean SmileStyle Dataset)
: https://github.com/smilegate-ai/korean_smile_style_dataset
#### AI νλΈ κ°μ± λν λ§λμΉ
: https://www.aihub.or.kr/
#### λ°μ΄ν°μ
λ€μ΄λ‘λ(AIνλΈλ μ§μ λ€μ΄λ‘λλ§ κ°λ₯)
```bash
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
```
### κ°λ° νκ²½
```bash
Python3.9
```
```bash
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3
```
#### μ¬μ© λͺ¨λΈ
beomi/kcbert-base
- GitHub : https://github.com/Beomi/KcBERT
- HuggingFace : https://huggingface.co/beomi/kcbert-base
***
## λ°μ΄ν°
```bash
get_train_data.py
```
### μμ
|sentence|label|
|------|---|
|곡λΆλ₯Ό μ΄μ¬ν ν΄λ μ΄μ¬ν ν λ§νΌ μ±μ μ΄ μ λμ€μ§ μμ|0|
|μλ€μκ² λ³΄λ΄λ λ¬Έμλ₯Ό ν΅ν΄ κ΄κ³κ° ν볡λκΈΈ λ°λκ²μ|1|
|μ°Έ μ΄μ¬ν μ¬μ 보λμ΄ μμΌμλ€μ|1|
|λλ μ€μ μ’μν¨ μ΄λ² λ¬λΆν° μκ΅ κ° λ―|0|
|λ³ΈλΆμ₯λμ΄ λ΄κ° ν μ μλ μ
무λ₯Ό κ³μ μ£Όμ
μ νλ€μ΄|0|
### λΆν¬
|label|train|test|
|------|---|---|
|0|133,430|34,908|
|1|112,828|29,839|
***
## νμ΅(train)
```bash
python3 modeling/train.py
```
***
## μμΈ‘(inference)
```bash
python3 inference.py
```
```python
def formal_percentage(self, text):
return round(float(self.predict(text)[0][1]), 2)
def print_message(self, text):
result = self.formal_persentage(text)
if result > 0.5:
print(f'{text} : μ‘΄λλ§μ
λλ€. ( νλ₯ {result*100}% )')
if result < 0.5:
print(f'{text} : λ°λ§μ
λλ€. ( νλ₯ {((1 - result)*100)}% )')
```
κ²°κ³Ό
```
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνμ
¨λλ° κΈ°μ΅λμΈμ? : μ‘΄λλ§μ
λλ€. ( νλ₯ 99.19% )
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ? : λ°λ§μ
λλ€. ( νλ₯ 92.86% )
```
***
## μΈμ©
```bash
@misc{SmilegateAI2022KoreanSmileStyleDataset,
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
author = {Seonghyun Kim},
year = {2022},
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
```
```bash
@inproceedings{lee2020kcbert,
title={KcBERT: Korean Comments BERT},
author={Lee, Junbum},
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
pages={437--440},
year={2020}
}
```
|