EleutherAI
/

polyglot-ko-1.3b

@@ -6,122 +6,134 @@ tags:
 - causal-lm
 license: apache-2.0
 datasets:
-- Created by tunib.
 ---
 # GPT-NeoX-Ko-1.3B
 ## Model Description
-We firstly targeted Korean language because most of our contributors were Korean when we started our research. We collected about 1.2TB Korean dataset for this work, which was done with TUNiB. In addition, we used the GPT-NeoX framework for model training and added 8 Korean tasks to LM-Evaluation-Harness for model evaluation.
-<figure>
-| Hyperparameter       | Value      |
-|----------------------|------------|
-| \\(n_{parameters}\\) | 1331810304 |
-| \\(n_{layers}\\)     | 24&ast;    |
-| \\(d_{model}\\)      | 2048       |
-| \\(d_{ff}\\)         | 8192      |
-| \\(n_{heads}\\)      | 16         |
-| \\(d_{head}\\)       | 128        |
-| \\(n_{ctx}\\)        | 2048       |
-| \\(n_{vocab}\\)      | 30080/30000&dagger;  |
-| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
-<figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
-The model consists of 24 layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
 dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
-dimensions of each head. The model is trained with a tokenization vocabulary of 30080.
 ## Training data
-GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by [tunib-ai](https://tunib.ai/).
 ## Training procedure
-This model was trained for 402 billion tokens over 102,000 steps on A100 x 256 pods. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
-## Intended Use and Limitations
-GPT-NeoX-Ko learns an inner representation of the Korean language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
-### How to use
 This model can be easily loaded using the `AutoModelForCausalLM` functionality:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
-model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
 ```
-### Limitations and Biases
-The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.
-GPT-NeoX-Ko was trained on the Large Scale Korean Datsets, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
 As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Evaluation results
-<figure>
-|  Model                   | Public      | Training FLOPs | kobest_boolq ↓ | kobest_copa ↑ | kobest_wic ↑ | kobest_hellaswag ↑ | kobest_sentineg ↑    | Dataset Size (GB) |
-|--------------------------|-------------|----------------|---            |---            |---           |---          |---        |-------------------|
-| KoGPT-trinity&ddagger;   | &cross;     | -----          | 0.6663           | 0.6222           | 0.656          | 0.4011         | 0.3534       | -----             |
-| KoGPT-KakaoBrain&ddagger;   | &cross;     | -----          | 0.3241           | 0.719           | 0.1356          | 0.4616         | 0.8065       | -----             |
-| GPT-NeoX-Ko-1.3B(ours)&ddagger;   | &cross;     | -----          | 0.5174           | 0.7072           | 0.6567          | 0.417         | 0.8444       | -----             |
-<figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
-<p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by
-running <a href="https://github.com/EleutherAI/lm-evaluation-harness/"><code>lm-evaluation-harness</code></a> either with released
-weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these
-might not be directly comparable. See <a href="https://blog.eleuther.ai/gpt3-model-sizes/">this blog post</a> for more
-details.</p>
-<p><strong>†</strong> Megatron-11B provides no comparable metrics, and several implementations using the released weights do not
-reproduce the generation quality and evaluations. (see <a href="https://github.com/huggingface/transformers/pull/10301">1</a>
-<a href="https://github.com/pytorch/fairseq/issues/2358">2</a> <a href="https://github.com/pytorch/fairseq/issues/2719">3</a>)
-Thus, evaluation was not attempted.</p>
-<p><strong>‡</strong> These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models
-failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is
-trained on the Pile, which has not been deduplicated against any test sets.</p></figcaption></figure>
 ## Citation and Related Information
 ### BibTeX entry
-To cite this model:
 ```bibtex
 @misc{gpt-neox-ko,
   title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
-  author = {Hyunwoong, Ko and Kichang, Yang and Minho, Ryu and Taekyun, Kim, ...},
   url = {https://www.github.com/eleutherai/multilingual},
   month = {9},
   year = {2022},
 }
 ```
-If you use this model, we would love to hear about it! Reach out on [GitHub](https://github.com/kingoflolz/mesh-transformer-jax), Discord, or shoot Ben an email.
-## Acknowledgements
-This project would not have been possible without compute generously provided by Google through the
-[TPU Research Cloud](https://sites.research.google/trc/), as well as the Cloud TPU team for providing early access to the [Cloud TPU VM](https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms) Alpha.
-Thanks to everyone who have helped out one way or another (listed alphabetically):
-- [James Bradbury](https://twitter.com/jekbradbury) for valuable assistance with debugging JAX issues.
-- [Stella Biderman](https://www.stellabiderman.com), [Eric Hallahan](https://twitter.com/erichallahan), [Kurumuz](https://github.com/kurumuz/), and [Finetune](https://github.com/finetuneanon/) for converting the model to be compatible with the `transformers` package.
-- [Leo Gao](https://twitter.com/nabla_theta) for running zero shot evaluations for the baseline models for the table.
-- [Laurence Golding](https://github.com/researcher2/) for adding some features to the web demo.
-- [Aran Komatsuzaki](https://twitter.com/arankomatsuzaki) for advice with experiment design and writing the blog posts.
-- [Janko Prester](https://github.com/jprester/) for creating the web demo frontend.

 - causal-lm
 license: apache-2.0
 datasets:
+- Large-scale Korean dataset created by tunib.
 ---
 # GPT-NeoX-Ko-1.3B
 ## Model Description
+GPT-NeoX-Ko is a Korean autoregressive language model made by EleutherAI multilingual team. We collected about 1.2TB Korean dataset for this work, which was done with [TUNiB](https://tunib.ai/). In addition, we used the GPT-NeoX framework for model training and added some Korean tasks to LM-Evaluation-Harness for model evaluation.
+| Hyperparameter       | Value                                                                                                                                  |
+|----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
+| \\(n_{parameters}\\) | 13,3181,0304                                                                                                                           |
+| \\(n_{layers}\\)     | 24                                                                                                                                     |
+| \\(d_{model}\\)      | 2048                                                                                                                                   |
+| \\(d_{ff}\\)         | 8192                                                                                                                                   |
+| \\(n_{heads}\\)      | 16                                                                                                                                     |
+| \\(d_{head}\\)       | 128                                                                                                                                    |
+| \\(n_{ctx}\\)        | 2048                                                                                                                                   |
+| \\(n_{vocab}\\)      | 30,000 / 30,080                                                                                                                        |
+| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)                                                                   |
 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
+The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
 dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
+dimensions of each head. The model is trained with a tokenization vocabulary of 30000.
 ## Training data
+GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by [TUNiB](https://tunib.ai/).
 ## Training procedure
+GPT-NeoX-Ko was trained for 213 billion tokens over 102,000 steps on 256 * A100 GPUs. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
+## How to use
 This model can be easily loaded using the `AutoModelForCausalLM` functionality:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("[EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b)")
+model = AutoModelForCausalLM.from_pretrained("[EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b)")
 ```
+## Privacy considerations and Limitations
+GPT-NeoX-Ko learns an inner representation of the Korean that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
+### Privacy considerations
+General training algorithms for pretrained language model have many hazards that memorize personal information in training data. We added the following tokens to vocabulary to mitigate privacy problem and replaced much personal information to these tokens in data preprocessing steps.
+* `<|acc|>` : bank account number
+* `<|rrn|>` : resident registration number
+* `<|tell|>` : phone number
+### Limitations and Biases
+The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
 As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Evaluation results
+We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks for model evaluation.
+We added the corresponding tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
+The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using [multilingual-ko branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/multilingual-ko).
+- the number of few shot examples = 1
+| Model                                                                                        | parameters | boolq | copa   | wic    | hellaswag | sentineg | average |
+|----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
+| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |       |        |        |           |          |         |
+| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |       |        |        |           |          |         |
+| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.659 | 0.6993 | 0.6292 | 0.3884    | 0.8427   | 0.64372 |
+- the number of few shot examples = 5
+| Model                                                                                        | parameters | boolq  | copa   | wic   | hellaswag | sentineg | average |
+|----------------------------------------------------------------------------------------------|------------|--------|--------|-------|-----------|----------|---------|
+| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |        |        |       |           |          |         |
+| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |        |        |       |           |          |         |
+| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.6309 | 0.7053 | 0.656 | 0.3984    | 0.7979   | 0.6337  |
+- the number of few shot examples = 10
+| Model                                                                                        | parameters | boolq      | copa       | wic        | hellaswag  | sentineg   | average    |
+|----------------------------------------------------------------------------------------------|------------|------------|------------|------------|------------|------------|------------|
+| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       | **0.6663** | 0.6222     | 0.656      | 0.4011     | 0.3534     | 0.5398     |
+| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       | 0.3241     | 0.719      | 0.1356     | **0.4616** | 0.8056     | 0.48936    |
+| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.5174     | 0.**7072** | **0.6567** | 0.417      | **0.8444** | **0.5468** |
+- the number of few shot examples = 50
+| Model                                                                                        | parameters | boolq | copa   | wic    | hellaswag | sentineg | average |
+|----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
+| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |       |        |        |           |          |         |
+| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |       |        |        |           |          |         |
+| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.49  | 0.7097 | 0.5834 | 0.4416    | 0.7382   | 0.59258 |
+- the number of few shot examples = 100
+| Model                                                                                        | parameters | boolq  | copa   | wic    | hellaswag | sentineg | average |
+|----------------------------------------------------------------------------------------------|------------|--------|--------|--------|-----------|----------|---------|
+| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; |            |        |        |        |           |          |         |
+| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            |            |        |        |        |           |          |         |
+| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     |            | 0.4867 | 0.7207 | 0.5877 | 0.5877    | 0.7407   | 0.59234 |
+<p><strong>&dagger;</strong> The model card of this model provides evaluation results for the KOBEST dataset, but when we evaluated the model with the prompts described in the paper, we can't get similar results to it. Therefore, we checked the KOBEST paper and found that the results were similar to the fine-tuning results reported in the paper. Because we evaluated prompt-based generation without fine-tuning the model, the results provided by the model card for the this model may differ.</p>
+<p><strong>&ast;</strong> Since this model does not provide evaluation results with KOBEST dataset, we evaluated the model using lm-evaluation-harness ourselves. you can reproduce this result using the source code included in the multilingual-ko branch of lm-evaluation-harness.</p>
 ## Citation and Related Information
 ### BibTeX entry
+If you find our work useful, please consider citing:
 ```bibtex
 @misc{gpt-neox-ko,
   title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
+  author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Kim, Taekyun and Yang, Seungmu and Hyun, Jiwoong and Park, Sungho and Ryu, Myunghyun and Keum, Bitna and Oh, Saechan and Kim, Soohwan and Park, Kyubyong},
   url = {https://www.github.com/eleutherai/multilingual},
   month = {9},
   year = {2022},
 }
 ```
+### Acknowledgements
+This project would not have been possible without compute generously provided by [Stability.ai](https://stability.ai), thanks them for providing a large amount of GPU resources for this work.
+And thanks also go to [TUNiB](https://tunib.ai) for providing a large-scale Korean dataset for this work.