---
language: ko
license: cc-by-nc-sa-4.0
tags:
- gpt2
---


# Model Card for kogpt2-base-v2
 
# Model Details
 
## Model Description
 
[GPT-2](https://openai.com/blog/better-language-models/)는 주어진 텍스트의 다음 단어를 잘 예측할 수 있도록 학습된 언어모델이며 문장 생성에 최적화 되어 있습니다. `KoGPT2`는 부족한 한국어 성능을 극복하기 위해 40GB 이상의 텍스트로 학습된 한국어 디코더(`decoder`) 언어모델입니다.
 
- **Developed by:** SK Telecom
- **Shared by [Optional]:** SK Telecom
- **Model type:** Text Generation
- **Language(s) (NLP):** Korean
- **License:** cc-by-nc-sa-4.0 
- **Parent Model:** GPT-2
- **Resources for more information:**
  - [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master)
   - [Model Demo Space](https://huggingface.co/spaces/gogamza/kogpt2-base-v2)
 	

# Uses
 

## Direct Use
This model can be used for the task of Text Generation
 
## Downstream Use [Optional]
 
More information needed.
 
## Out-of-Scope Use
 
The model should not be used to intentionally create hostile or alienating environments for people. 
 
# Bias, Risks, and Limitations
 
 
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.


## Recommendations
 
 
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

# Training Details
 
## Training Data
 The model authors also note in the [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master):
 
[`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다.
 
사전 크기는 51,200 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다.
 
> 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`...
 
[한국어 위키 백과](https://ko.wikipedia.org/) 이외, 뉴스, [모두의 말뭉치 v1.0](https://corpus.korean.go.kr/), [청와대 국민청원](https://github.com/akngs/petitions) 등의 다양한 데이터가 모델 학습에 사용되었습니다.
 
 
## Training Procedure

 
### Preprocessing
 
More information needed 

 
### Speeds, Sizes, Times
| Model       |  # of params |   Type   | # of layers  | # of heads | ffn_dim | hidden_dims |
|--------------|:----:|:-------:|--------:|--------:|--------:|--------------:|
| `kogpt2-base-v2` |  125M  |  Decoder |   12     | 12      | 3072    | 768 |

 
# Evaluation
 
 
## Testing Data, Factors & Metrics
 
### Testing Data
 
More information needed 
 
 
### Factors
More information needed
 
### Metrics
 
More information needed
 
 
## Results 
 

### Classification or Regression
|   |  [NSMC](https://github.com/e9t/nsmc)(acc)  | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) |
|---|---|---|
| **KoGPT2 2.0**  | 89.1  | 77.8  |

 
# Model Examination
 
More information needed
 
# Environmental Impact
 
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
 
# Technical Specifications [optional]
 
## Model Architecture and Objective

More information needed 
 
## Compute Infrastructure
 
More information needed 
 
### Hardware
 
 
More information needed
 
### Software
 
More information needed.
 
# Citation

 
**BibTeX:**
 
More information needed 
 
# Glossary [optional]
More information needed 
 
# More Information [optional]
More information needed 

 
# Model Card Authors [optional]
 
SK Telecom in collaboration with Ezi Ozoani and the Hugging Face team


# Model Card Contact
 The model authors also note in the [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master)
> `KoGPT2` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoGPT2/issues)에 올려주세요.
 
# How to Get Started with the Model
 
Use the code below to get started with the model.
 
<details>
<summary> Click to expand </summary>

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")

model = AutoModelForCausalLM.from_pretrained("skt/kogpt2-base-v2")
 
 ```
</details>