polyglot-ko-1.3b / README.md
hyunwoongko's picture
Update README.md
6a727c1
metadata
language:
  - ko
tags:
  - pytorch
  - causal-lm
license: apache-2.0

GPT-NeoX-Ko-1.3B

Model Description

GPT-NeoX-Ko is a Korean autoregressive language model made by EleutherAI multilingual team. We collected about 1.2TB Korean dataset for this work, which was done with TUNiB. In addition, we used the GPT-NeoX framework for model training and added some Korean tasks to LM-Evaluation-Harness for model evaluation.

Hyperparameter Value
nparametersn_{parameters} 13,3181,0304
nlayersn_{layers} 24
dmodeld_{model} 2048
dffd_{ff} 8192
nheadsn_{heads} 16
dheadd_{head} 128
nctxn_{ctx} 2048
nvocabn_{vocab} 30,000 / 30,080
Positional Encoding Rotary Position Embedding (RoPE)
RoPE Dimensions 64

The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 30000.

Training data

GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by TUNiB.

Training procedure

GPT-NeoX-Ko was trained for 213 billion tokens over 102,000 steps on 256 * A100 GPUs. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

How to use

This model can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3b")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3b")

Privacy considerations and Limitations

GPT-NeoX-Ko learns an inner representation of the Korean that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.

Privacy considerations

General training algorithms for pretrained language model have many hazards that memorize personal information in training data. We added the following tokens to vocabulary to mitigate privacy problem and replaced much personal information to these tokens in data preprocessing steps.

  • <|acc|> : bank account number
  • <|rrn|> : resident registration number
  • <|tell|> : phone number

Limitations and Biases

The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.

As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

Evaluation results

We used the KOBEST dataset, which consists of five Korean downstream tasks for model evaluation. We added the corresponding tasks to lm-evaluation-harness and utilized prompt templates described in the paper. The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using multilingual-ko branch of lm-evaluation-harness.

  • the number of few shot examples = 1
Model parameters boolq copa wic hellaswag sentineg average
skt/ko-gpt-trinity-1.2B-v0.5 † 1.2B
kakaobrain/kogpt * 6.0B
EleutherAI/gpt-neox-ko-1.3b (ours) 1.3B 0.659 0.6993 0.6292 0.3884 0.8427 0.64372
  • the number of few shot examples = 5
Model parameters boolq copa wic hellaswag sentineg average
skt/ko-gpt-trinity-1.2B-v0.5 † 1.2B
kakaobrain/kogpt * 6.0B
EleutherAI/gpt-neox-ko-1.3b (ours) 1.3B 0.6309 0.7053 0.656 0.3984 0.7979 0.6337
  • the number of few shot examples = 10
Model parameters boolq copa wic hellaswag sentineg average
skt/ko-gpt-trinity-1.2B-v0.5 † 1.2B 0.6663 0.6222 0.656 0.4011 0.3534 0.5398
kakaobrain/kogpt * 6.0B 0.3241 0.719 0.1356 0.4616 0.8056 0.48936
EleutherAI/gpt-neox-ko-1.3b (ours) 1.3B 0.5174 0.7072 0.6567 0.417 0.8444 0.5468
  • the number of few shot examples = 50
Model parameters boolq copa wic hellaswag sentineg average
skt/ko-gpt-trinity-1.2B-v0.5 † 1.2B
kakaobrain/kogpt * 6.0B
EleutherAI/gpt-neox-ko-1.3b (ours) 1.3B 0.49 0.7097 0.5834 0.4416 0.7382 0.59258
  • the number of few shot examples = 100
Model parameters boolq copa wic hellaswag sentineg average
skt/ko-gpt-trinity-1.2B-v0.5 †
kakaobrain/kogpt *
EleutherAI/gpt-neox-ko-1.3b (ours) 0.4867 0.7207 0.5877 0.5877 0.7407 0.59234

† The model card of this model provides evaluation results for the KOBEST dataset, but when we evaluated the model with the prompts described in the paper, we can't get similar results to it. Therefore, we checked the KOBEST paper and found that the results were similar to the fine-tuning results reported in the paper. Because we evaluated by prompt-based generation without fine-tuning the model, the results provided by the model card for the this model may differ.

* Since this model does not provide evaluation results with KOBEST dataset, we evaluated the model using lm-evaluation-harness ourselves. you can reproduce this result using the source code included in the multilingual-ko branch of lm-evaluation-harness.

Citation and Related Information

BibTeX entry

If you find our work useful, please consider citing:

@misc{gpt-neox-ko,
  title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
  author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Kim, Taekyun and Yang, Seungmu and Hyun, Jiwoong and Park, Sungho and Ryu, Myunghyun and Keum, Bitna and Oh, Saechan and Kim, Soohwan and Park, Kyubyong},
  url = {https://www.github.com/eleutherai/multilingual},
  month = {9},
  year = {2022},
}

Acknowledgements

This project would not have been possible without compute generously provided by Stability.ai, thanks them for providing a large amount of GPU resources. And thanks also go to TUNiB for providing a large-scale Korean dataset for this work.