--- language: ko tags: - text-2-text-generation --- # Model Card for Bert base model for Korean # Model Details ## Model Description More information needed. - **Developed by:** kiyoung kim - **Shared by [Optional]:** kiyoung kim - **Model type:** Text2Text Generation - **Language(s) (NLP):** Korean - **License:** More information needed - **Parent Model:** bert-base-multilingual-uncased - **Resources for more information:** - [GitHub Repo](https://github.com/kiyoungkim1/LM-kor) # Uses ## Direct Use This model can be used for the task of text2text generation. ## Downstream Use [Optional] More information needed. ## Out-of-Scope Use The model should not be used to intentionally create hostile or alienating environments for people. # Bias, Risks, and Limitations Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. # Training Details ## Training Data * 70GB Korean text dataset and 42000 lower-cased subwords are used The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor): > 학습에 사용한 데이터는 다음과 같습니다. 1.) 국내 주요 커머스 리뷰 1억개 + 블로그 형 웹사이트 2000만개 (75GB) 2.) 모두의 말뭉치 (18GB) 3.) 위키피디아와 나무위키 (6GB) 불필요하거나 너무 짤은 문장, 중복되는 문장들을 제외하여 100GB의 데이터 중 최종적으로 70GB (약 127억개의 token)의 텍스트 데이터를 학습에 사용하였습니다. 데이터는 화장품(8GB), 식품(6GB), 전자제품(13GB), 반려동물(2GB) 등등의 카테고리로 분류되어 있으며 도메인 특화 언어모델 학습에 사용하였습니다 ## Training Procedure ### Preprocessing The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor): > BERT 모델에는 whole-word-masking이 적용되었습니다. > 한글, 영어, 숫자와 일부 특수문자를 제외한 문자는 학습에 방해가된다고 판단하여 삭제하였습니다(예시: 한자, 이모지 등) [Huggingface tokenizers](https://github.com/huggingface/tokenizers) 의 wordpiece모델을 사용해 40000개의 subword를 생성하였습니다. 여기에 2000개의 unused token과 넣어 학습하였으며, unused token는 도메인 별 특화 용어를 담기 위해 사용됩니다. ### Speeds, Sizes, Times More information needed # Evaluation ## Testing Data, Factors & Metrics ### Testing Data More information needed ### Factors More information needed ### Metrics More information needed ## Results * Check the model performance and other language models for Korean in [github](https://github.com/kiyoungkim1/LM-kor) | | **NSMC**
(acc) | **Naver NER**
(F1) | **PAWS**
(acc) | **KorNLI**
(acc) | **KorSTS**
(spearman) | **Question Pair**
(acc) | **Korean-Hate-Speech (Dev)**
(F1) | | :-------------------- | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :-----------------------------------: | | kcbert-base | 89.87 | 85.00 | 67.40 | 75.57 | 75.94 | 93.93 | **68.78** | |**OURS**| | **bert-kor-base** | 90.87 | 87.27 | 82.80 | 82.32 | 84.31 | 95.25 | 68.45 | # Model Examination More information needed # Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** More information needed - **Hours used:** More information needed - **Cloud Provider:** More information needed - **Compute Region:** More information needed - **Carbon Emitted:** More information needed # Technical Specifications [optional] ## Model Architecture and Objective More information needed ## Compute Infrastructure More information needed ### Hardware More information needed ### Software More information needed. # Citation **BibTeX:** ```bibtex @misc{kim2020lmkor, author = {Kiyoung Kim}, title = {Pretrained Language Models For Korean}, year = {2020}, publisher = {GitHub}, howpublished = {\url{https://github.com/kiyoungkim1/LMkor}} } ``` # Glossary [optional] More information needed # More Information [optional] * Cloud TPUs are provided by [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc/) program. * Also, [모두의 말뭉치](https://corpus.korean.go.kr/) is used for pretraining data. # Model Card Authors [optional] Kiyoung kim in collaboration with Ezi Ozoani and the Hugging Face team # Model Card Contact More information needed # How to Get Started with the Model Use the code below to get started with the model.
Click to expand ```python # only for pytorch in transformers from transformers import BertTokenizerFast, EncoderDecoderModel tokenizer = BertTokenizerFast.from_pretrained("kykim/bertshared-kor-base") model = EncoderDecoderModel.from_pretrained("kykim/bertshared-kor-base") ```