j5ng's picture
Update README.md
35a522c
---
license: apache-2.0
language:
- ko
pipeline_tag: text-classification
---
# formal_classifier
formal classifier or honorific classifier
## ν•œκ΅­μ–΄ μ‘΄λŒ“λ§ 반말 λΆ„λ₯˜κΈ°
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?"))
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
```
***
### 데이터 μ…‹ 좜처
#### 슀마일게이트 말투 데이터 μ…‹(korean SmileStyle Dataset)
: https://github.com/smilegate-ai/korean_smile_style_dataset
#### AI ν—ˆλΈŒ 감성 λŒ€ν™” λ§λ­‰μΉ˜
: https://www.aihub.or.kr/
#### 데이터셋 λ‹€μš΄λ‘œλ“œ(AIν—ˆλΈŒλŠ” μ§μ ‘λ‹€μš΄λ‘œλ“œλ§Œ κ°€λŠ₯)
```bash
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
```
### 개발 ν™˜κ²½
```bash
Python3.9
```
```bash
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3
```
#### μ‚¬μš© λͺ¨λΈ
beomi/kcbert-base
- GitHub : https://github.com/Beomi/KcBERT
- HuggingFace : https://huggingface.co/beomi/kcbert-base
***
### μ˜ˆμ‹œ
|sentence|label|
|------|---|
|곡뢀λ₯Ό μ—΄μ‹¬νžˆ 해도 μ—΄μ‹¬νžˆ ν•œ 만큼 성적이 잘 λ‚˜μ˜€μ§€ μ•Šμ•„|0|
|μ•„λ“€μ—κ²Œ λ³΄λ‚΄λŠ” 문자λ₯Ό 톡해 관계가 회볡되길 λ°”λž„κ²Œμš”|1|
|μ°Έ μ—΄μ‹¬νžˆ 사신 보람이 μžˆμœΌμ‹œλ„€μš”|1|
|λ‚˜λ„ μŠ€μ‹œ 쒋아함 이번 달뢀터 영ꡭ 갈 λ“―|0|
|λ³ΈλΆ€μž₯λ‹˜μ΄ λ‚΄κ°€ ν•  수 μ—†λŠ” 업무λ₯Ό 계속 μ£Όμ…”μ„œ νž˜λ“€μ–΄|0|
### 뢄포
|label|train|test|
|------|---|---|
|0|133,430|34,908|
|1|112,828|29,839|
***
κ²°κ³Ό
```
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν•˜μ…¨λŠ”λ° κΈ°μ–΅λ‚˜μ„Έμš”? : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  99.19% )
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜? : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  92.86% )
```
***
## 인용
```bash
@misc{SmilegateAI2022KoreanSmileStyleDataset,
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
author = {Seonghyun Kim},
year = {2022},
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
```
```bash
@inproceedings{lee2020kcbert,
title={KcBERT: Korean Comments BERT},
author={Lee, Junbum},
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
pages={437--440},
year={2020}
}
```