|
--- |
|
license: apache-2.0 |
|
language: |
|
- ko |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# formal_classifier |
|
formal classifier or honorific classifier |
|
|
|
## νκ΅μ΄ μ‘΄λλ§ λ°λ§ λΆλ₯κΈ° |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier") |
|
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier') |
|
|
|
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer) |
|
print(formal_classifier("μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?")) |
|
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}] |
|
``` |
|
|
|
*** |
|
|
|
### λ°μ΄ν° μ
μΆμ² |
|
|
|
#### μ€λ§μΌκ²μ΄νΈ λ§ν¬ λ°μ΄ν° μ
(korean SmileStyle Dataset) |
|
: https://github.com/smilegate-ai/korean_smile_style_dataset |
|
|
|
#### AI νλΈ κ°μ± λν λ§λμΉ |
|
: https://www.aihub.or.kr/ |
|
|
|
#### λ°μ΄ν°μ
λ€μ΄λ‘λ(AIνλΈλ μ§μ λ€μ΄λ‘λλ§ κ°λ₯) |
|
```bash |
|
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv |
|
``` |
|
|
|
### κ°λ° νκ²½ |
|
```bash |
|
Python3.9 |
|
``` |
|
|
|
```bash |
|
torch==1.13.1 |
|
transformers==4.26.0 |
|
pandas==1.5.3 |
|
emoji==2.2.0 |
|
soynlp==0.0.493 |
|
datasets==2.10.1 |
|
pandas==1.5.3 |
|
``` |
|
|
|
|
|
#### μ¬μ© λͺ¨λΈ |
|
beomi/kcbert-base |
|
- GitHub : https://github.com/Beomi/KcBERT |
|
- HuggingFace : https://huggingface.co/beomi/kcbert-base |
|
*** |
|
|
|
### μμ |
|
|sentence|label| |
|
|------|---| |
|
|곡λΆλ₯Ό μ΄μ¬ν ν΄λ μ΄μ¬ν ν λ§νΌ μ±μ μ΄ μ λμ€μ§ μμ|0| |
|
|μλ€μκ² λ³΄λ΄λ λ¬Έμλ₯Ό ν΅ν΄ κ΄κ³κ° ν볡λκΈΈ λ°λκ²μ|1| |
|
|μ°Έ μ΄μ¬ν μ¬μ 보λμ΄ μμΌμλ€μ|1| |
|
|λλ μ€μ μ’μν¨ μ΄λ² λ¬λΆν° μκ΅ κ° λ―|0| |
|
|λ³ΈλΆμ₯λμ΄ λ΄κ° ν μ μλ μ
무λ₯Ό κ³μ μ£Όμ
μ νλ€μ΄|0| |
|
|
|
|
|
### λΆν¬ |
|
|label|train|test| |
|
|------|---|---| |
|
|0|133,430|34,908| |
|
|1|112,828|29,839| |
|
|
|
*** |
|
|
|
κ²°κ³Ό |
|
``` |
|
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνμ
¨λλ° κΈ°μ΅λμΈμ? : μ‘΄λλ§μ
λλ€. ( νλ₯ 99.19% ) |
|
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ? : λ°λ§μ
λλ€. ( νλ₯ 92.86% ) |
|
``` |
|
|
|
|
|
|
|
*** |
|
|
|
## μΈμ© |
|
```bash |
|
@misc{SmilegateAI2022KoreanSmileStyleDataset, |
|
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset}, |
|
author = {Seonghyun Kim}, |
|
year = {2022}, |
|
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}}, |
|
} |
|
``` |
|
|
|
```bash |
|
@inproceedings{lee2020kcbert, |
|
title={KcBERT: Korean Comments BERT}, |
|
author={Lee, Junbum}, |
|
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology}, |
|
pages={437--440}, |
|
year={2020} |
|
} |
|
``` |
|
|