model documentation (#1)

- model documentation (9f62a16f664fd459acfa480cdcef14070a0c64ce)

Co-authored-by: Nazneen Rajani <nazneen@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +177 -9

README.md CHANGED Viewed

@@ -1,23 +1,191 @@
 ---
 language: ko
 tags:
 - bart
-license: mit
 ---
-## KoBART-base-v2
-With the addition of chatting data, the model is trained to handle the semantics of sequences longer than KoBART.
-```python
-from transformers import PreTrainedTokenizerFast, BartModel
-tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
-model = BartModel.from_pretrained('gogamza/kobart-base-v2')
-```
-### Performance
 NSMC
 - acc. : 0.901

 ---
 language: ko
+license: mit
 tags:
 - bart
 ---
+# Model Card for kobart-base-v2
+# Model Details
+## Model Description
+[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)는 입력 텍스트 일부에 노이즈를 추가하여 이를 다시 원문으로 복구하는 `autoencoder`의 형태로 학습이 됩니다. 한국어 BART(이하 **KoBART**) 는 논문에서 사용된 `Text Infilling` 노이즈 함수를 사용하여 **40GB** 이상의 한국어 텍스트에 대해서 학습한 한국어 `encoder-decoder` 언어 모델입니다. 이를 통해 도출된 `KoBART-base`를 배포합니다.
+- **Developed by:** More information needed
+- **Shared by [Optional]:** Heewon(Haven) Jeon
+- **Model type:** Feature Extraction
+- **Language(s) (NLP):** Korean
+- **License:** MIT
+- **Parent Model:** BART
+- **Resources for more information:**
+  - [GitHub Repo](https://github.com/haven-jeon/KoBART)
+   - [Model Demo Space](https://huggingface.co/spaces/gogamza/kobart-summarization)
+# Uses
+## Direct Use
+This model can be used for the task of Feature Extraction.
+## Downstream Use [Optional]
+More information needed.
+## Out-of-Scope Use
+The model should not be used to intentionally create hostile or alienating environments for people.
+# Bias, Risks, and Limitations
+Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
+## Recommendations
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+# Training Details
+## Training Data
+| Data  | # of Sentences |
+|-------|---------------:|
+| Korean Wiki |     5M   |
+| Other corpus |  0.27B    |
+한국어 위키 백과 이외, 뉴스, 책, [모두의 말뭉치 v1.0(대화, 뉴스, ...)](https://corpus.korean.go.kr/), [청와대 국민청원](https://github.com/akngs/petitions) 등의 다양한 데이터가 모델 학습에 사용되었습니다.
+`vocab` 사이즈는 30,000 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다.
+> 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`...
+## Training Procedure
+### Tokenizer
+[`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다.
+### Speeds, Sizes, Times
+| Model       |  # of params |   Type   | # of layers  | # of heads | ffn_dim | hidden_dims |
+|--------------|:----:|:-------:|--------:|--------:|--------:|--------------:|
+| `KoBART-base` |  124M  |  Encoder |   6     | 16      | 3072    | 768 |
+|               |        | Decoder |   6     | 16      | 3072    | 768 |
+# Evaluation
+## Testing Data, Factors & Metrics
+### Testing Data
+More information needed
+### Factors
+More information needed
+### Metrics
+More information needed
+## Results
 NSMC
 - acc. : 0.901
+The model authors also note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
+|   |  [NSMC](https://github.com/e9t/nsmc)(acc)  | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) |
+|---|---|---|---|
+| **KoBART-base**  | 90.24  | 81.66  | 94.34  |
+# Model Examination
+More information needed
+# Environmental Impact
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** More information needed
+- **Hours used:** More information needed
+- **Cloud Provider:** More information needed
+- **Compute Region:** More information needed
+- **Carbon Emitted:** More information needed
+# Technical Specifications [optional]
+## Model Architecture and Objective
+More information needed
+## Compute Infrastructure
+More information needed
+### Hardware
+More information needed
+### Software
+More information needed.
+# Citation
+**BibTeX:**
+More information needed.
+# Glossary [optional]
+More information needed
+# More Information [optional]
+More information needed
+# Model Card Authors [optional]
+ Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team
+# Model Card Contact
+ The model authors note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
+`KoBART` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoBART/issues)에 올려주세요.
+# How to Get Started with the Model
+Use the code below to get started with the model.
+<details>
+<summary> Click to expand </summary>
+```python
+ from transformers import PreTrainedTokenizerFast, BartModel
+tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
+model = BartModel.from_pretrained('gogamza/kobart-base-v2')
+ ```
+</details>