j5ng
/

kcbert-formal-classifier

Text Classification

Inference Endpoints

Model card Files Files and versions Community

kcbert-formal-classifier / README.md

j5ng's picture

Update README.md

35a522c over 1 year ago

|

history blame contribute delete

2.74 kB

	---
	license: apache-2.0
	language:
	- ko
	pipeline_tag: text-classification
	---

	# formal_classifier
	formal classifier or honorific classifier

	## 한국어 존댓말 반말 분류기

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
	tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')

	formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
	print(formal_classifier("저번에 교수님께서 자료 가져오라했는데 기억나?"))
	# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
	```

	***

	### 데이터 셋 출처

	#### 스마일게이트 말투 데이터 셋(korean SmileStyle Dataset)
	: https://github.com/smilegate-ai/korean_smile_style_dataset

	#### AI 허브 감성 대화 말뭉치
	: https://www.aihub.or.kr/

	#### 데이터셋 다운로드(AI허브는 직접다운로드만 가능)
	```bash
	wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
	```

	### 개발 환경
	```bash
	Python3.9
	```

	```bash
	torch==1.13.1
	transformers==4.26.0
	pandas==1.5.3
	emoji==2.2.0
	soynlp==0.0.493
	datasets==2.10.1
	pandas==1.5.3
	```


	#### 사용 모델
	beomi/kcbert-base
	- GitHub : https://github.com/Beomi/KcBERT
	- HuggingFace : https://huggingface.co/beomi/kcbert-base
	***

	### 예시
	\|sentence\|label\|
	\|------\|---\|
	\|공부를 열심히 해도 열심히 한 만큼 성적이 잘 나오지 않아\|0\|
	\|아들에게 보내는 문자를 통해 관계가 회복되길 바랄게요\|1\|
	\|참 열심히 사신 보람이 있으시네요\|1\|
	\|나도 스시 좋아함 이번 달부터 영국 갈 듯\|0\|
	\|본부장님이 내가 할 수 없는 업무를 계속 주셔서 힘들어\|0\|


	### 분포
	\|label\|train\|test\|
	\|------\|---\|---\|
	\|0\|133,430\|34,908\|
	\|1\|112,828\|29,839\|

	***

	결과
	```
	저번에 교수님께서 자료 가져오라하셨는데 기억나세요? : 존댓말입니다. ( 확률 99.19% )
	저번에 교수님께서 자료 가져오라했는데 기억나? : 반말입니다. ( 확률 92.86% )
	```



	***

	## 인용
	```bash
	@misc{SmilegateAI2022KoreanSmileStyleDataset,
	title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
	author = {Seonghyun Kim},
	year = {2022},
	howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
	}
	```

	```bash
	@inproceedings{lee2020kcbert,
	title={KcBERT: Korean Comments BERT},
	author={Lee, Junbum},
	booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
	pages={437--440},
	year={2020}
	}
	```