File size: 2,739 Bytes
890e0dc
 
e92823b
 
 
890e0dc
e92823b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: apache-2.0
language:
- ko
pipeline_tag: text-classification
---

# formal_classifier
formal classifier or honorific classifier

## ν•œκ΅­μ–΄ μ‘΄λŒ“λ§ 반말 λΆ„λ₯˜κΈ°

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')

formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?")) 
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
```

***

### 데이터 μ…‹ 좜처

#### 슀마일게이트 말투 데이터 μ…‹(korean SmileStyle Dataset)
 : https://github.com/smilegate-ai/korean_smile_style_dataset

#### AI ν—ˆλΈŒ 감성 λŒ€ν™” λ§λ­‰μΉ˜
 : https://www.aihub.or.kr/
 
 #### 데이터셋 λ‹€μš΄λ‘œλ“œ(AIν—ˆλΈŒλŠ” μ§μ ‘λ‹€μš΄λ‘œλ“œλ§Œ κ°€λŠ₯)
 ```bash
 wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
 ```
 
 ### 개발 ν™˜κ²½
 ```bash
 Python3.9
 ```
 
 ```bash
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3
 ```
 
 
 #### μ‚¬μš© λͺ¨λΈ 
 beomi/kcbert-base 
  - GitHub : https://github.com/Beomi/KcBERT
  - HuggingFace : https://huggingface.co/beomi/kcbert-base
***

### μ˜ˆμ‹œ
|sentence|label|
|------|---|
|곡뢀λ₯Ό μ—΄μ‹¬νžˆ 해도 μ—΄μ‹¬νžˆ ν•œ 만큼 성적이 잘 λ‚˜μ˜€μ§€ μ•Šμ•„|0|
|μ•„λ“€μ—κ²Œ λ³΄λ‚΄λŠ” 문자λ₯Ό 톡해 관계가 회볡되길 λ°”λž„κ²Œμš”|1|
|μ°Έ μ—΄μ‹¬νžˆ 사신 보람이 μžˆμœΌμ‹œλ„€μš”|1|
|λ‚˜λ„ μŠ€μ‹œ 쒋아함 이번 달뢀터 영ꡭ 갈 λ“―|0|
|λ³ΈλΆ€μž₯λ‹˜μ΄ λ‚΄κ°€ ν•  수 μ—†λŠ” 업무λ₯Ό 계속 μ£Όμ…”μ„œ νž˜λ“€μ–΄|0|


### 뢄포
|label|train|test|
|------|---|---|
|0|133,430|34,908|
|1|112,828|29,839|

***

κ²°κ³Ό 
```
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν•˜μ…¨λŠ”λ° κΈ°μ–΅λ‚˜μ„Έμš”? : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  99.19% )
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜? : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  92.86% )
```



***

## 인용
```bash
@misc{SmilegateAI2022KoreanSmileStyleDataset,
  title         = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
  author        = {Seonghyun Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
```

```bash
@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}
```