kobart-base-v2 / README.md
gogamza's picture
model documentation (#1)
f9f2ec3
---
language: ko
license: mit
tags:
- bart
---
# Model Card for kobart-base-v2
# Model Details
## Model Description
[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” `autoencoder`의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ BART(μ΄ν•˜ **KoBART**) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ `Text Infilling` λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ **40GB** μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ `encoder-decoder` μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ `KoBART-base`λ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.
- **Developed by:** More information needed
- **Shared by [Optional]:** Heewon(Haven) Jeon
- **Model type:** Feature Extraction
- **Language(s) (NLP):** Korean
- **License:** MIT
- **Parent Model:** BART
- **Resources for more information:**
- [GitHub Repo](https://github.com/haven-jeon/KoBART)
- [Model Demo Space](https://huggingface.co/spaces/gogamza/kobart-summarization)
# Uses
## Direct Use
This model can be used for the task of Feature Extraction.
## Downstream Use [Optional]
More information needed.
## Out-of-Scope Use
The model should not be used to intentionally create hostile or alienating environments for people.
# Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
## Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
# Training Details
## Training Data
| Data | # of Sentences |
|-------|---------------:|
| Korean Wiki | 5M |
| Other corpus | 0.27B |
ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, [λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ v1.0(λŒ€ν™”, λ‰΄μŠ€, ...)](https://corpus.korean.go.kr/), [μ²­μ™€λŒ€ ꡭ민청원](https://github.com/akngs/petitions) λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
`vocab` μ‚¬μ΄μ¦ˆλŠ” 30,000 이며 λŒ€ν™”μ— 자주 μ“°μ΄λŠ” μ•„λž˜μ™€ 같은 이λͺ¨ν‹°μ½˜, 이λͺ¨μ§€ 등을 μΆ”κ°€ν•˜μ—¬ ν•΄λ‹Ή ν† ν°μ˜ 인식 λŠ₯λ ₯을 μ˜¬λ ΈμŠ΅λ‹ˆλ‹€.
> πŸ˜€, 😁, πŸ˜†, πŸ˜…, 🀣, .. , `:-)`, `:)`, `-)`, `(-:`...
## Training Procedure
### Tokenizer
[`tokenizers`](https://github.com/huggingface/tokenizers) νŒ¨ν‚€μ§€μ˜ `Character BPE tokenizer`둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
### Speeds, Sizes, Times
| Model | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims |
|--------------|:----:|:-------:|--------:|--------:|--------:|--------------:|
| `KoBART-base` | 124M | Encoder | 6 | 16 | 3072 | 768 |
| | | Decoder | 6 | 16 | 3072 | 768 |
# Evaluation
## Testing Data, Factors & Metrics
### Testing Data
More information needed
### Factors
More information needed
### Metrics
More information needed
## Results
NSMC
- acc. : 0.901
The model authors also note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
| | [NSMC](https://github.com/e9t/nsmc)(acc) | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) |
|---|---|---|---|
| **KoBART-base** | 90.24 | 81.66 | 94.34 |
# Model Examination
More information needed
# Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
# Technical Specifications [optional]
## Model Architecture and Objective
More information needed
## Compute Infrastructure
More information needed
### Hardware
More information needed
### Software
More information needed.
# Citation
**BibTeX:**
More information needed.
# Glossary [optional]
More information needed
# More Information [optional]
More information needed
# Model Card Authors [optional]
Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team
# Model Card Contact
The model authors note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
`KoBART` κ΄€λ ¨ μ΄μŠˆλŠ” [이곳](https://github.com/SKT-AI/KoBART/issues)에 μ˜¬λ €μ£Όμ„Έμš”.
# How to Get Started with the Model
Use the code below to get started with the model.
<details>
<summary> Click to expand </summary>
```python
from transformers import PreTrainedTokenizerFast, BartModel
tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartModel.from_pretrained('gogamza/kobart-base-v2')
```
</details>