--- language: ko license: mit tags: - bart --- # Model Card for kobart-base-v2 # Model Details ## Model Description [**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)는 입력 텍스트 일부에 노이즈를 추가하여 이를 다시 원문으로 복구하는 `autoencoder`의 형태로 학습이 됩니다. 한국어 BART(이하 **KoBART**) 는 논문에서 사용된 `Text Infilling` 노이즈 함수를 사용하여 **40GB** 이상의 한국어 텍스트에 대해서 학습한 한국어 `encoder-decoder` 언어 모델입니다. 이를 통해 도출된 `KoBART-base`를 배포합니다. - **Developed by:** More information needed - **Shared by [Optional]:** Heewon(Haven) Jeon - **Model type:** Feature Extraction - **Language(s) (NLP):** Korean - **License:** MIT - **Parent Model:** BART - **Resources for more information:** - [GitHub Repo](https://github.com/haven-jeon/KoBART) - [Model Demo Space](https://huggingface.co/spaces/gogamza/kobart-summarization) # Uses ## Direct Use This model can be used for the task of Feature Extraction. ## Downstream Use [Optional] More information needed. ## Out-of-Scope Use The model should not be used to intentionally create hostile or alienating environments for people. # Bias, Risks, and Limitations Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. # Training Details ## Training Data | Data | # of Sentences | |-------|---------------:| | Korean Wiki | 5M | | Other corpus | 0.27B | 한국어 위키 백과 이외, 뉴스, 책, [모두의 말뭉치 v1.0(대화, 뉴스, ...)](https://corpus.korean.go.kr/), [청와대 국민청원](https://github.com/akngs/petitions) 등의 다양한 데이터가 모델 학습에 사용되었습니다. `vocab` 사이즈는 30,000 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다. > 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`... ## Training Procedure ### Tokenizer [`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다. ### Speeds, Sizes, Times | Model | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims | |--------------|:----:|:-------:|--------:|--------:|--------:|--------------:| | `KoBART-base` | 124M | Encoder | 6 | 16 | 3072 | 768 | | | | Decoder | 6 | 16 | 3072 | 768 | # Evaluation ## Testing Data, Factors & Metrics ### Testing Data More information needed ### Factors More information needed ### Metrics More information needed ## Results NSMC - acc. : 0.901 The model authors also note in the [GitHub Repo](https://github.com/haven-jeon/KoBART): | | [NSMC](https://github.com/e9t/nsmc)(acc) | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) | |---|---|---|---| | **KoBART-base** | 90.24 | 81.66 | 94.34 | # Model Examination More information needed # Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** More information needed - **Hours used:** More information needed - **Cloud Provider:** More information needed - **Compute Region:** More information needed - **Carbon Emitted:** More information needed # Technical Specifications [optional] ## Model Architecture and Objective More information needed ## Compute Infrastructure More information needed ### Hardware More information needed ### Software More information needed. # Citation **BibTeX:** More information needed. # Glossary [optional] More information needed # More Information [optional] More information needed # Model Card Authors [optional] Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team # Model Card Contact The model authors note in the [GitHub Repo](https://github.com/haven-jeon/KoBART): `KoBART` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoBART/issues)에 올려주세요. # How to Get Started with the Model Use the code below to get started with the model.
Click to expand ```python from transformers import PreTrainedTokenizerFast, BartModel tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2') model = BartModel.from_pretrained('gogamza/kobart-base-v2') ```