KYUNGHYUN9 commited on
Commit
c0f3f26
1 Parent(s): 8a663e6

Upload 11 files

Browse files
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ license: apache-2.0
8
+
9
+ ---
10
+ # Polyglot-Ko-1.3B
11
+
12
+ ## Model Description
13
+ Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team.
14
+
15
+ | Hyperparameter | Value |
16
+ |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
17
+ | \\(n_{parameters}\\) | 1,331,810,304 |
18
+ | \\(n_{layers}\\) | 24 |
19
+ | \\(d_{model}\\) | 2,048 |
20
+ | \\(d_{ff}\\) | 8,192 |
21
+ | \\(n_{heads}\\) | 16 |
22
+ | \\(d_{head}\\) | 128 |
23
+ | \\(n_{ctx}\\) | 2,048 |
24
+ | \\(n_{vocab}\\) | 30,003 / 30,080 |
25
+ | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
26
+ | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
27
+
28
+ The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
29
+ dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
30
+ dimensions of each head. The model is trained with a tokenization vocabulary of 30003.
31
+
32
+ ## Training data
33
+
34
+ Polyglot-Ko-1.3B was trained on 863 GB of Korean language data (1.2TB before processing), a large-scale dataset curated by [TUNiB](https://tunib.ai/). The data collection process has abided by South Korean laws. This dataset was collected for the purpose of training Polyglot-Ko models, so it will not be released for public use.
35
+
36
+ | Source |Size (GB) | Link |
37
+ |-------------------------------------|---------|------------------------------------------|
38
+ | Korean blog posts | 682.3 | - |
39
+ | Korean news dataset | 87.0 | - |
40
+ | Modu corpus | 26.4 |corpus.korean.go.kr |
41
+ | Korean patent dataset | 19.0 | - |
42
+ | Korean Q & A dataset | 18.1 | - |
43
+ | KcBert dataset | 12.7 | github.com/Beomi/KcBERT |
44
+ | Korean fiction dataset | 6.1 | - |
45
+ | Korean online comments | 4.2 | - |
46
+ | Korean wikipedia | 1.4 | ko.wikipedia.org |
47
+ | Clova call | < 1.0 | github.com/clovaai/ClovaCall |
48
+ | Naver sentiment movie corpus | < 1.0 | github.com/e9t/nsmc |
49
+ | Korean hate speech dataset | < 1.0 | - |
50
+ | Open subtitles | < 1.0 | opus.nlpl.eu/OpenSubtitles.php |
51
+ | AIHub various tasks datasets | < 1.0 |aihub.or.kr |
52
+ | Standard Korean language dictionary | < 1.0 | stdict.korean.go.kr/main/main.do |
53
+
54
+ Furthermore, in order to avoid the model memorizing and generating personally identifiable information (PII) in the training data, we masked out the following sensitive information in the pre-processing stage:
55
+
56
+ * `<|acc|>` : bank account number
57
+ * `<|rrn|>` : resident registration number
58
+ * `<|tell|>` : phone number
59
+
60
+ ## Training procedure
61
+ Polyglot-Ko-1.3B was trained on 213 billion tokens over 102,000 steps on 256 A100 GPUs with the [GPT-NeoX framework](https://github.com/EleutherAI/gpt-neox). It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token.
62
+
63
+ ## How to use
64
+
65
+ This model can be easily loaded using the `AutoModelForCausalLM` class:
66
+
67
+ ```python
68
+ from transformers import AutoTokenizer, AutoModelForCausalLM
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/polyglot-ko-1.3b")
71
+ model = AutoModelForCausalLM.from_pretrained("EleutherAI/polyglot-ko-1.3b")
72
+ ```
73
+
74
+ ## Evaluation results
75
+
76
+ We evaluate Polyglot-Ko-1.3B on [KOBEST dataset](https://arxiv.org/abs/2204.04541), a benchmark with 5 downstream tasks, against comparable models such as skt/ko-gpt-trinity-1.2B-v0.5, kakaobrain/kogpt and facebook/xglm-7.5B, using the prompts provided in the paper.
77
+
78
+ The following tables show the results when the number of few-shot examples differ. You can reproduce these results using the [polyglot branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot) and the following scripts. For a fair comparison, all models were run under the same conditions and using the same prompts. In the tables, `n` refers to the number of few-shot examples.
79
+
80
+ In case of WiC dataset, all models show random performance.
81
+
82
+ ```console
83
+ python main.py \
84
+ --model gpt2 \
85
+ --model_args pretrained='EleutherAI/polyglot-ko-1.3b' \
86
+ --tasks kobest_copa,kobest_hellaswag,kobest_boolq,kobest_sentineg,kobest_wic \
87
+ --num_fewshot $YOUR_NUM_FEWSHOT \
88
+ --batch_size $YOUR_BATCH_SIZE \
89
+ --device $YOUR_DEVICE \
90
+ --output_path $/path/to/output/
91
+ ```
92
+
93
+ ### COPA (F1)
94
+
95
+ | Model | params | 0-shot | 5-shot | 10-shot | 50-shot |
96
+ |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------|
97
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) | 1.2B | 0.6696 | 0.6477 | 0.6419 | 0.6514 |
98
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) | 6.0B | 0.7345 | 0.7287 | 0.7277 | 0.7479 |
99
+ | [facebook/xglm-7.5B](https://huggingface.co/facebook/xglm-7.5B) | 7.5B | 0.6723 | 0.6731 | 0.6769 | 0.7119 |
100
+ | **[EleutherAI/polyglot-ko-1.3b](https://huggingface.co/EleutherAI/polyglot-ko-1.3b) (this)** | **1.3B** | **0.7196** | **0.7193** | **0.7204** | **0.7206** |
101
+ | [EleutherAI/polyglot-ko-3.8b](https://huggingface.co/EleutherAI/polyglot-ko-3.8b) | 3.8B | 0.7595 | 0.7608 | 0.7638 | 0.7788 |
102
+ | [EleutherAI/polyglot-ko-5.8b](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) | 5.8B | 0.7745 | 0.7676 | 0.7775 | 0.7887 |
103
+ | [EleutherAI/polyglot-ko-12.8b](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) | 12.8B | 0.7937 | 0.8108 | 0.8037 | 0.8369 |
104
+
105
+ <img src="https://github.com/EleutherAI/polyglot/assets/19511788/d5b49364-aed5-4467-bae2-5a322c8e2ceb" width="800px">
106
+
107
+ ### HellaSwag (F1)
108
+
109
+ | Model | params | 0-shot | 5-shot | 10-shot | 50-shot |
110
+ |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------|
111
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) | 1.2B | 0.5243 | 0.5272 | 0.5166 | 0.5352 |
112
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) | 6.0B | 0.5590 | 0.5833 | 0.5828 | 0.5907 |
113
+ | [facebook/xglm-7.5B](https://huggingface.co/facebook/xglm-7.5B) | 7.5B | 0.5665 | 0.5689 | 0.5565 | 0.5622 |
114
+ | **[EleutherAI/polyglot-ko-1.3b](https://huggingface.co/EleutherAI/polyglot-ko-1.3b) (this)** | **1.3B** | **0.5247** | **0.5260** | **0.5278** | **0.5427** |
115
+ | [EleutherAI/polyglot-ko-3.8b](https://huggingface.co/EleutherAI/polyglot-ko-3.8b) | 3.8B | 0.5707 | 0.5830 | 0.5670 | 0.5787 |
116
+ | [EleutherAI/polyglot-ko-5.8b](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) | 5.8B | 0.5976 | 0.5998 | 0.5979 | 0.6208 |
117
+ | [EleutherAI/polyglot-ko-12.8b](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) | 12.8B | 0.5954 | 0.6306 | 0.6098 | 0.6118 |
118
+
119
+ <img src="https://github.com/EleutherAI/polyglot/assets/19511788/5acb60ac-161a-4ab3-a296-db4442e08b7f" width="800px">
120
+
121
+ ### BoolQ (F1)
122
+
123
+ | Model | params | 0-shot | 5-shot | 10-shot | 50-shot |
124
+ |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------|
125
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) | 1.2B | 0.3356 | 0.4014 | 0.3640 | 0.3560 |
126
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) | 6.0B | 0.4514 | 0.5981 | 0.5499 | 0.5202 |
127
+ | [facebook/xglm-7.5B](https://huggingface.co/facebook/xglm-7.5B) | 7.5B | 0.4464 | 0.3324 | 0.3324 | 0.3324 |
128
+ | **[EleutherAI/polyglot-ko-1.3b](https://huggingface.co/EleutherAI/polyglot-ko-1.3b) (this)** | **1.3B** | **0.3552** | **0.4751** | **0.4109** | **0.4038** |
129
+ | [EleutherAI/polyglot-ko-3.8b](https://huggingface.co/EleutherAI/polyglot-ko-3.8b) | 3.8B | 0.4320 | 0.5263 | 0.4930 | 0.4038 |
130
+ | [EleutherAI/polyglot-ko-5.8b](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) | 5.8B | 0.4356 | 0.5698 | 0.5187 | 0.5236 |
131
+ | [EleutherAI/polyglot-ko-12.8b](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) | 12.8B | 0.4818 | 0.6041 | 0.6289 | 0.6448 |
132
+
133
+ <img src="https://github.com/EleutherAI/polyglot/assets/19511788/b74c23c0-01f3-4b68-9e10-a48e9aa052ab" width="800px">
134
+
135
+ ### SentiNeg (F1)
136
+
137
+ | Model | params | 0-shot | 5-shot | 10-shot | 50-shot |
138
+ |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------|
139
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) | 1.2B | 0.6065 | 0.6878 | 0.7280 | 0.8413 |
140
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) | 6.0B | 0.3747 | 0.8942 | 0.9294 | 0.9698 |
141
+ | [facebook/xglm-7.5B](https://huggingface.co/facebook/xglm-7.5B) | 7.5B | 0.3578 | 0.4471 | 0.3964 | 0.5271 |
142
+ | **[EleutherAI/polyglot-ko-1.3b](https://huggingface.co/EleutherAI/polyglot-ko-1.3b) (this)** | **1.3B** | **0.6790** | **0.6257** | **0.5514** | **0.7851** |
143
+ | [EleutherAI/polyglot-ko-3.8b](https://huggingface.co/EleutherAI/polyglot-ko-3.8b) | 3.8B | 0.4858 | 0.7950 | 0.7320 | 0.7851 |
144
+ | [EleutherAI/polyglot-ko-5.8b](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) | 5.8B | 0.3394 | 0.8841 | 0.8808 | 0.9521 |
145
+ | [EleutherAI/polyglot-ko-12.8b](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) | 12.8B | 0.9117 | 0.9015 | 0.9345 | 0.9723 |
146
+
147
+ <img src="https://github.com/EleutherAI/polyglot/assets/19511788/95b56b19-d349-4b70-9ff9-94a5560f89ee" width="800px">
148
+
149
+ ### WiC (F1)
150
+
151
+ | Model | params | 0-shot | 5-shot | 10-shot | 50-shot |
152
+ |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------|
153
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) | 1.2B | 0.3290 | 0.4313 | 0.4001 | 0.3621 |
154
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) | 6.0B | 0.3526 | 0.4775 | 0.4358 | 0.4061 |
155
+ | [facebook/xglm-7.5B](https://huggingface.co/facebook/xglm-7.5B) | 7.5B | 0.3280 | 0.4903 | 0.4945 | 0.3656 |
156
+ | **[EleutherAI/polyglot-ko-1.3b](https://huggingface.co/EleutherAI/polyglot-ko-1.3b) (this)** | **1.3B** | **0.3297** | **0.4850** | **0.465** | **0.3290** |
157
+ | [EleutherAI/polyglot-ko-3.8b](https://huggingface.co/EleutherAI/polyglot-ko-3.8b) | 3.8B | 0.3390 | 0.4944 | 0.4203 | 0.3835 |
158
+ | [EleutherAI/polyglot-ko-5.8b](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) | 5.8B | 0.3913 | 0.4688 | 0.4189 | 0.3910 |
159
+ | [EleutherAI/polyglot-ko-12.8b](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) | 12.8B | 0.3985 | 0.3683 | 0.3307 | 0.3273 |
160
+
161
+ <img src="https://github.com/EleutherAI/polyglot/assets/19511788/4de4a4c3-d7ac-4e04-8b0c-0d533fe88294" width="800px">
162
+
163
+ ## Limitations and Biases
164
+
165
+ Polyglot-Ko has been trained to optimize next token prediction. Language models such as this are often used for a wide variety of tasks and it is important to be aware of possible unexpected outcomes. For instance, Polyglot-Ko will not always return the most factual or accurate response but the most statistically likely one. In addition, Polyglot may produce socially unacceptable or offensive content. We recommend having a human curator or other filtering mechanism to censor sensitive content.
166
+
167
+ ## Citation and Related Information
168
+ ### BibTeX entry
169
+ If you find our work useful, please consider citing:
170
+ ```bibtex
171
+ @misc{ko2023technical,
172
+ title={A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models},
173
+ author={Hyunwoong Ko and Kichang Yang and Minho Ryu and Taekyoon Choi and Seungmu Yang and jiwung Hyun and Sungho Park},
174
+ year={2023},
175
+ eprint={2306.02254},
176
+ archivePrefix={arXiv},
177
+ primaryClass={cs.CL}
178
+ }
179
+ ```
180
+
181
+ ### Licensing
182
+ All our models are licensed under the terms of the Apache License 2.0.
183
+
184
+ ```
185
+ Licensed under the Apache License, Version 2.0 (the "License");
186
+ you may not use this file except in compliance with the License.
187
+ You may obtain a copy of the License at
188
+
189
+ http://www.apache.org/licenses/LICENSE-2.0
190
+
191
+ Unless required by applicable law or agreed to in writing, software
192
+ distributed under the License is distributed on an "AS IS" BASIS,
193
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
194
+ See the License for the specific language governing permissions and
195
+ limitations under the License.
196
+ ```
197
+
198
+ ### Acknowledgement
199
+
200
+ This project was made possible thanks to the computing resources from [Stability.ai](https://stability.ai), and thanks to [TUNiB](https://tunib.ai) for providing a large-scale Korean dataset for this work.
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./polyglot-ko-1.3b/",
3
+ "architectures": [
4
+ "GPTNeoXForCausalLM"
5
+ ],
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": 0.1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_size": 2048,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 8192,
13
+ "layer_norm_eps": 1e-05,
14
+ "max_position_embeddings": 2048,
15
+ "model_type": "gpt_neox",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "rotary_emb_base": 10000,
19
+ "rotary_pct": 0.5,
20
+ "tie_word_embeddings": false,
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.29.2",
23
+ "use_cache": true,
24
+ "use_parallel_residual": true,
25
+ "vocab_size": 30080
26
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.29.2"
6
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc6dcf159c11ab442b1ce00c85124a4e13d735c1540661e90b273cac0b438c4a
3
+ size 1000292202
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd1929594672146d268055fa2c322e312c883bab7fb8c1bc33efb9878d541dae
3
+ size 1015555724
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75f223f820f5ba69867063bd4e0b6c5277e5bc3e0313a26f686e31d355179a6e
3
+ size 748480810
model.safetensors.index.json ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 2676206640.0
4
+ },
5
+ "weight_map": {
6
+ "embed_out.weight": "model-00003-of-00003.safetensors",
7
+ "gpt_neox.embed_in.weight": "model-00001-of-00003.safetensors",
8
+ "gpt_neox.final_layer_norm.bias": "model-00003-of-00003.safetensors",
9
+ "gpt_neox.final_layer_norm.weight": "model-00003-of-00003.safetensors",
10
+ "gpt_neox.layers.0.attention.bias": "model-00001-of-00003.safetensors",
11
+ "gpt_neox.layers.0.attention.dense.bias": "model-00001-of-00003.safetensors",
12
+ "gpt_neox.layers.0.attention.dense.weight": "model-00001-of-00003.safetensors",
13
+ "gpt_neox.layers.0.attention.masked_bias": "model-00001-of-00003.safetensors",
14
+ "gpt_neox.layers.0.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
15
+ "gpt_neox.layers.0.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
16
+ "gpt_neox.layers.0.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
17
+ "gpt_neox.layers.0.input_layernorm.bias": "model-00001-of-00003.safetensors",
18
+ "gpt_neox.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
19
+ "gpt_neox.layers.0.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
20
+ "gpt_neox.layers.0.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
21
+ "gpt_neox.layers.0.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
22
+ "gpt_neox.layers.0.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
23
+ "gpt_neox.layers.0.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
24
+ "gpt_neox.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
25
+ "gpt_neox.layers.1.attention.bias": "model-00001-of-00003.safetensors",
26
+ "gpt_neox.layers.1.attention.dense.bias": "model-00001-of-00003.safetensors",
27
+ "gpt_neox.layers.1.attention.dense.weight": "model-00001-of-00003.safetensors",
28
+ "gpt_neox.layers.1.attention.masked_bias": "model-00001-of-00003.safetensors",
29
+ "gpt_neox.layers.1.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
30
+ "gpt_neox.layers.1.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
31
+ "gpt_neox.layers.1.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
32
+ "gpt_neox.layers.1.input_layernorm.bias": "model-00001-of-00003.safetensors",
33
+ "gpt_neox.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
34
+ "gpt_neox.layers.1.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
35
+ "gpt_neox.layers.1.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
36
+ "gpt_neox.layers.1.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
37
+ "gpt_neox.layers.1.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
38
+ "gpt_neox.layers.1.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
39
+ "gpt_neox.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
40
+ "gpt_neox.layers.10.attention.bias": "model-00002-of-00003.safetensors",
41
+ "gpt_neox.layers.10.attention.dense.bias": "model-00002-of-00003.safetensors",
42
+ "gpt_neox.layers.10.attention.dense.weight": "model-00002-of-00003.safetensors",
43
+ "gpt_neox.layers.10.attention.masked_bias": "model-00002-of-00003.safetensors",
44
+ "gpt_neox.layers.10.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
45
+ "gpt_neox.layers.10.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
46
+ "gpt_neox.layers.10.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
47
+ "gpt_neox.layers.10.input_layernorm.bias": "model-00002-of-00003.safetensors",
48
+ "gpt_neox.layers.10.input_layernorm.weight": "model-00002-of-00003.safetensors",
49
+ "gpt_neox.layers.10.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
50
+ "gpt_neox.layers.10.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
51
+ "gpt_neox.layers.10.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
52
+ "gpt_neox.layers.10.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
53
+ "gpt_neox.layers.10.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
54
+ "gpt_neox.layers.10.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
55
+ "gpt_neox.layers.11.attention.bias": "model-00002-of-00003.safetensors",
56
+ "gpt_neox.layers.11.attention.dense.bias": "model-00002-of-00003.safetensors",
57
+ "gpt_neox.layers.11.attention.dense.weight": "model-00002-of-00003.safetensors",
58
+ "gpt_neox.layers.11.attention.masked_bias": "model-00002-of-00003.safetensors",
59
+ "gpt_neox.layers.11.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
60
+ "gpt_neox.layers.11.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
61
+ "gpt_neox.layers.11.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
62
+ "gpt_neox.layers.11.input_layernorm.bias": "model-00002-of-00003.safetensors",
63
+ "gpt_neox.layers.11.input_layernorm.weight": "model-00002-of-00003.safetensors",
64
+ "gpt_neox.layers.11.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
65
+ "gpt_neox.layers.11.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
66
+ "gpt_neox.layers.11.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
67
+ "gpt_neox.layers.11.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
68
+ "gpt_neox.layers.11.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
69
+ "gpt_neox.layers.11.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
70
+ "gpt_neox.layers.12.attention.bias": "model-00002-of-00003.safetensors",
71
+ "gpt_neox.layers.12.attention.dense.bias": "model-00002-of-00003.safetensors",
72
+ "gpt_neox.layers.12.attention.dense.weight": "model-00002-of-00003.safetensors",
73
+ "gpt_neox.layers.12.attention.masked_bias": "model-00002-of-00003.safetensors",
74
+ "gpt_neox.layers.12.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
75
+ "gpt_neox.layers.12.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
76
+ "gpt_neox.layers.12.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
77
+ "gpt_neox.layers.12.input_layernorm.bias": "model-00002-of-00003.safetensors",
78
+ "gpt_neox.layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
79
+ "gpt_neox.layers.12.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
80
+ "gpt_neox.layers.12.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
81
+ "gpt_neox.layers.12.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
82
+ "gpt_neox.layers.12.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
83
+ "gpt_neox.layers.12.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
84
+ "gpt_neox.layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
85
+ "gpt_neox.layers.13.attention.bias": "model-00002-of-00003.safetensors",
86
+ "gpt_neox.layers.13.attention.dense.bias": "model-00002-of-00003.safetensors",
87
+ "gpt_neox.layers.13.attention.dense.weight": "model-00002-of-00003.safetensors",
88
+ "gpt_neox.layers.13.attention.masked_bias": "model-00002-of-00003.safetensors",
89
+ "gpt_neox.layers.13.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
90
+ "gpt_neox.layers.13.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
91
+ "gpt_neox.layers.13.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
92
+ "gpt_neox.layers.13.input_layernorm.bias": "model-00002-of-00003.safetensors",
93
+ "gpt_neox.layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
94
+ "gpt_neox.layers.13.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
95
+ "gpt_neox.layers.13.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
96
+ "gpt_neox.layers.13.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
97
+ "gpt_neox.layers.13.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
98
+ "gpt_neox.layers.13.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
99
+ "gpt_neox.layers.13.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
100
+ "gpt_neox.layers.14.attention.bias": "model-00002-of-00003.safetensors",
101
+ "gpt_neox.layers.14.attention.dense.bias": "model-00002-of-00003.safetensors",
102
+ "gpt_neox.layers.14.attention.dense.weight": "model-00002-of-00003.safetensors",
103
+ "gpt_neox.layers.14.attention.masked_bias": "model-00002-of-00003.safetensors",
104
+ "gpt_neox.layers.14.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
105
+ "gpt_neox.layers.14.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
106
+ "gpt_neox.layers.14.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
107
+ "gpt_neox.layers.14.input_layernorm.bias": "model-00002-of-00003.safetensors",
108
+ "gpt_neox.layers.14.input_layernorm.weight": "model-00002-of-00003.safetensors",
109
+ "gpt_neox.layers.14.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
110
+ "gpt_neox.layers.14.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
111
+ "gpt_neox.layers.14.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
112
+ "gpt_neox.layers.14.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
113
+ "gpt_neox.layers.14.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
114
+ "gpt_neox.layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
115
+ "gpt_neox.layers.15.attention.bias": "model-00002-of-00003.safetensors",
116
+ "gpt_neox.layers.15.attention.dense.bias": "model-00002-of-00003.safetensors",
117
+ "gpt_neox.layers.15.attention.dense.weight": "model-00002-of-00003.safetensors",
118
+ "gpt_neox.layers.15.attention.masked_bias": "model-00002-of-00003.safetensors",
119
+ "gpt_neox.layers.15.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
120
+ "gpt_neox.layers.15.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
121
+ "gpt_neox.layers.15.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
122
+ "gpt_neox.layers.15.input_layernorm.bias": "model-00002-of-00003.safetensors",
123
+ "gpt_neox.layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
124
+ "gpt_neox.layers.15.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
125
+ "gpt_neox.layers.15.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
126
+ "gpt_neox.layers.15.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
127
+ "gpt_neox.layers.15.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
128
+ "gpt_neox.layers.15.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
129
+ "gpt_neox.layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
130
+ "gpt_neox.layers.16.attention.bias": "model-00002-of-00003.safetensors",
131
+ "gpt_neox.layers.16.attention.dense.bias": "model-00002-of-00003.safetensors",
132
+ "gpt_neox.layers.16.attention.dense.weight": "model-00002-of-00003.safetensors",
133
+ "gpt_neox.layers.16.attention.masked_bias": "model-00002-of-00003.safetensors",
134
+ "gpt_neox.layers.16.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
135
+ "gpt_neox.layers.16.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
136
+ "gpt_neox.layers.16.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
137
+ "gpt_neox.layers.16.input_layernorm.bias": "model-00002-of-00003.safetensors",
138
+ "gpt_neox.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
139
+ "gpt_neox.layers.16.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
140
+ "gpt_neox.layers.16.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
141
+ "gpt_neox.layers.16.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
142
+ "gpt_neox.layers.16.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
143
+ "gpt_neox.layers.16.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
144
+ "gpt_neox.layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
145
+ "gpt_neox.layers.17.attention.bias": "model-00002-of-00003.safetensors",
146
+ "gpt_neox.layers.17.attention.dense.bias": "model-00002-of-00003.safetensors",
147
+ "gpt_neox.layers.17.attention.dense.weight": "model-00002-of-00003.safetensors",
148
+ "gpt_neox.layers.17.attention.masked_bias": "model-00002-of-00003.safetensors",
149
+ "gpt_neox.layers.17.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
150
+ "gpt_neox.layers.17.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
151
+ "gpt_neox.layers.17.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
152
+ "gpt_neox.layers.17.input_layernorm.bias": "model-00002-of-00003.safetensors",
153
+ "gpt_neox.layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
154
+ "gpt_neox.layers.17.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
155
+ "gpt_neox.layers.17.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
156
+ "gpt_neox.layers.17.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
157
+ "gpt_neox.layers.17.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
158
+ "gpt_neox.layers.17.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
159
+ "gpt_neox.layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
160
+ "gpt_neox.layers.18.attention.bias": "model-00002-of-00003.safetensors",
161
+ "gpt_neox.layers.18.attention.dense.bias": "model-00003-of-00003.safetensors",
162
+ "gpt_neox.layers.18.attention.dense.weight": "model-00003-of-00003.safetensors",
163
+ "gpt_neox.layers.18.attention.masked_bias": "model-00002-of-00003.safetensors",
164
+ "gpt_neox.layers.18.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
165
+ "gpt_neox.layers.18.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
166
+ "gpt_neox.layers.18.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
167
+ "gpt_neox.layers.18.input_layernorm.bias": "model-00002-of-00003.safetensors",
168
+ "gpt_neox.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
169
+ "gpt_neox.layers.18.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
170
+ "gpt_neox.layers.18.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
171
+ "gpt_neox.layers.18.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
172
+ "gpt_neox.layers.18.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
173
+ "gpt_neox.layers.18.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
174
+ "gpt_neox.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
175
+ "gpt_neox.layers.19.attention.bias": "model-00003-of-00003.safetensors",
176
+ "gpt_neox.layers.19.attention.dense.bias": "model-00003-of-00003.safetensors",
177
+ "gpt_neox.layers.19.attention.dense.weight": "model-00003-of-00003.safetensors",
178
+ "gpt_neox.layers.19.attention.masked_bias": "model-00003-of-00003.safetensors",
179
+ "gpt_neox.layers.19.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
180
+ "gpt_neox.layers.19.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
181
+ "gpt_neox.layers.19.attention.rotary_emb.inv_freq": "model-00003-of-00003.safetensors",
182
+ "gpt_neox.layers.19.input_layernorm.bias": "model-00003-of-00003.safetensors",
183
+ "gpt_neox.layers.19.input_layernorm.weight": "model-00003-of-00003.safetensors",
184
+ "gpt_neox.layers.19.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
185
+ "gpt_neox.layers.19.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
186
+ "gpt_neox.layers.19.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
187
+ "gpt_neox.layers.19.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
188
+ "gpt_neox.layers.19.post_attention_layernorm.bias": "model-00003-of-00003.safetensors",
189
+ "gpt_neox.layers.19.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
190
+ "gpt_neox.layers.2.attention.bias": "model-00001-of-00003.safetensors",
191
+ "gpt_neox.layers.2.attention.dense.bias": "model-00001-of-00003.safetensors",
192
+ "gpt_neox.layers.2.attention.dense.weight": "model-00001-of-00003.safetensors",
193
+ "gpt_neox.layers.2.attention.masked_bias": "model-00001-of-00003.safetensors",
194
+ "gpt_neox.layers.2.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
195
+ "gpt_neox.layers.2.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
196
+ "gpt_neox.layers.2.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
197
+ "gpt_neox.layers.2.input_layernorm.bias": "model-00001-of-00003.safetensors",
198
+ "gpt_neox.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
199
+ "gpt_neox.layers.2.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
200
+ "gpt_neox.layers.2.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
201
+ "gpt_neox.layers.2.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
202
+ "gpt_neox.layers.2.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
203
+ "gpt_neox.layers.2.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
204
+ "gpt_neox.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
205
+ "gpt_neox.layers.20.attention.bias": "model-00003-of-00003.safetensors",
206
+ "gpt_neox.layers.20.attention.dense.bias": "model-00003-of-00003.safetensors",
207
+ "gpt_neox.layers.20.attention.dense.weight": "model-00003-of-00003.safetensors",
208
+ "gpt_neox.layers.20.attention.masked_bias": "model-00003-of-00003.safetensors",
209
+ "gpt_neox.layers.20.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
210
+ "gpt_neox.layers.20.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
211
+ "gpt_neox.layers.20.attention.rotary_emb.inv_freq": "model-00003-of-00003.safetensors",
212
+ "gpt_neox.layers.20.input_layernorm.bias": "model-00003-of-00003.safetensors",
213
+ "gpt_neox.layers.20.input_layernorm.weight": "model-00003-of-00003.safetensors",
214
+ "gpt_neox.layers.20.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
215
+ "gpt_neox.layers.20.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
216
+ "gpt_neox.layers.20.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
217
+ "gpt_neox.layers.20.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
218
+ "gpt_neox.layers.20.post_attention_layernorm.bias": "model-00003-of-00003.safetensors",
219
+ "gpt_neox.layers.20.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
220
+ "gpt_neox.layers.21.attention.bias": "model-00003-of-00003.safetensors",
221
+ "gpt_neox.layers.21.attention.dense.bias": "model-00003-of-00003.safetensors",
222
+ "gpt_neox.layers.21.attention.dense.weight": "model-00003-of-00003.safetensors",
223
+ "gpt_neox.layers.21.attention.masked_bias": "model-00003-of-00003.safetensors",
224
+ "gpt_neox.layers.21.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
225
+ "gpt_neox.layers.21.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
226
+ "gpt_neox.layers.21.attention.rotary_emb.inv_freq": "model-00003-of-00003.safetensors",
227
+ "gpt_neox.layers.21.input_layernorm.bias": "model-00003-of-00003.safetensors",
228
+ "gpt_neox.layers.21.input_layernorm.weight": "model-00003-of-00003.safetensors",
229
+ "gpt_neox.layers.21.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
230
+ "gpt_neox.layers.21.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
231
+ "gpt_neox.layers.21.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
232
+ "gpt_neox.layers.21.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
233
+ "gpt_neox.layers.21.post_attention_layernorm.bias": "model-00003-of-00003.safetensors",
234
+ "gpt_neox.layers.21.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
235
+ "gpt_neox.layers.22.attention.bias": "model-00003-of-00003.safetensors",
236
+ "gpt_neox.layers.22.attention.dense.bias": "model-00003-of-00003.safetensors",
237
+ "gpt_neox.layers.22.attention.dense.weight": "model-00003-of-00003.safetensors",
238
+ "gpt_neox.layers.22.attention.masked_bias": "model-00003-of-00003.safetensors",
239
+ "gpt_neox.layers.22.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
240
+ "gpt_neox.layers.22.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
241
+ "gpt_neox.layers.22.attention.rotary_emb.inv_freq": "model-00003-of-00003.safetensors",
242
+ "gpt_neox.layers.22.input_layernorm.bias": "model-00003-of-00003.safetensors",
243
+ "gpt_neox.layers.22.input_layernorm.weight": "model-00003-of-00003.safetensors",
244
+ "gpt_neox.layers.22.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
245
+ "gpt_neox.layers.22.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
246
+ "gpt_neox.layers.22.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
247
+ "gpt_neox.layers.22.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
248
+ "gpt_neox.layers.22.post_attention_layernorm.bias": "model-00003-of-00003.safetensors",
249
+ "gpt_neox.layers.22.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
250
+ "gpt_neox.layers.23.attention.bias": "model-00003-of-00003.safetensors",
251
+ "gpt_neox.layers.23.attention.dense.bias": "model-00003-of-00003.safetensors",
252
+ "gpt_neox.layers.23.attention.dense.weight": "model-00003-of-00003.safetensors",
253
+ "gpt_neox.layers.23.attention.masked_bias": "model-00003-of-00003.safetensors",
254
+ "gpt_neox.layers.23.attention.query_key_value.bias": "model-00003-of-00003.safetensors",
255
+ "gpt_neox.layers.23.attention.query_key_value.weight": "model-00003-of-00003.safetensors",
256
+ "gpt_neox.layers.23.attention.rotary_emb.inv_freq": "model-00003-of-00003.safetensors",
257
+ "gpt_neox.layers.23.input_layernorm.bias": "model-00003-of-00003.safetensors",
258
+ "gpt_neox.layers.23.input_layernorm.weight": "model-00003-of-00003.safetensors",
259
+ "gpt_neox.layers.23.mlp.dense_4h_to_h.bias": "model-00003-of-00003.safetensors",
260
+ "gpt_neox.layers.23.mlp.dense_4h_to_h.weight": "model-00003-of-00003.safetensors",
261
+ "gpt_neox.layers.23.mlp.dense_h_to_4h.bias": "model-00003-of-00003.safetensors",
262
+ "gpt_neox.layers.23.mlp.dense_h_to_4h.weight": "model-00003-of-00003.safetensors",
263
+ "gpt_neox.layers.23.post_attention_layernorm.bias": "model-00003-of-00003.safetensors",
264
+ "gpt_neox.layers.23.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
265
+ "gpt_neox.layers.3.attention.bias": "model-00001-of-00003.safetensors",
266
+ "gpt_neox.layers.3.attention.dense.bias": "model-00001-of-00003.safetensors",
267
+ "gpt_neox.layers.3.attention.dense.weight": "model-00001-of-00003.safetensors",
268
+ "gpt_neox.layers.3.attention.masked_bias": "model-00001-of-00003.safetensors",
269
+ "gpt_neox.layers.3.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
270
+ "gpt_neox.layers.3.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
271
+ "gpt_neox.layers.3.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
272
+ "gpt_neox.layers.3.input_layernorm.bias": "model-00001-of-00003.safetensors",
273
+ "gpt_neox.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
274
+ "gpt_neox.layers.3.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
275
+ "gpt_neox.layers.3.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
276
+ "gpt_neox.layers.3.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
277
+ "gpt_neox.layers.3.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
278
+ "gpt_neox.layers.3.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
279
+ "gpt_neox.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
280
+ "gpt_neox.layers.4.attention.bias": "model-00001-of-00003.safetensors",
281
+ "gpt_neox.layers.4.attention.dense.bias": "model-00001-of-00003.safetensors",
282
+ "gpt_neox.layers.4.attention.dense.weight": "model-00001-of-00003.safetensors",
283
+ "gpt_neox.layers.4.attention.masked_bias": "model-00001-of-00003.safetensors",
284
+ "gpt_neox.layers.4.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
285
+ "gpt_neox.layers.4.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
286
+ "gpt_neox.layers.4.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
287
+ "gpt_neox.layers.4.input_layernorm.bias": "model-00001-of-00003.safetensors",
288
+ "gpt_neox.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
289
+ "gpt_neox.layers.4.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
290
+ "gpt_neox.layers.4.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
291
+ "gpt_neox.layers.4.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
292
+ "gpt_neox.layers.4.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
293
+ "gpt_neox.layers.4.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
294
+ "gpt_neox.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
295
+ "gpt_neox.layers.5.attention.bias": "model-00001-of-00003.safetensors",
296
+ "gpt_neox.layers.5.attention.dense.bias": "model-00001-of-00003.safetensors",
297
+ "gpt_neox.layers.5.attention.dense.weight": "model-00001-of-00003.safetensors",
298
+ "gpt_neox.layers.5.attention.masked_bias": "model-00001-of-00003.safetensors",
299
+ "gpt_neox.layers.5.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
300
+ "gpt_neox.layers.5.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
301
+ "gpt_neox.layers.5.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
302
+ "gpt_neox.layers.5.input_layernorm.bias": "model-00001-of-00003.safetensors",
303
+ "gpt_neox.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
304
+ "gpt_neox.layers.5.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
305
+ "gpt_neox.layers.5.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
306
+ "gpt_neox.layers.5.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
307
+ "gpt_neox.layers.5.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
308
+ "gpt_neox.layers.5.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
309
+ "gpt_neox.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
310
+ "gpt_neox.layers.6.attention.bias": "model-00001-of-00003.safetensors",
311
+ "gpt_neox.layers.6.attention.dense.bias": "model-00001-of-00003.safetensors",
312
+ "gpt_neox.layers.6.attention.dense.weight": "model-00001-of-00003.safetensors",
313
+ "gpt_neox.layers.6.attention.masked_bias": "model-00001-of-00003.safetensors",
314
+ "gpt_neox.layers.6.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
315
+ "gpt_neox.layers.6.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
316
+ "gpt_neox.layers.6.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
317
+ "gpt_neox.layers.6.input_layernorm.bias": "model-00001-of-00003.safetensors",
318
+ "gpt_neox.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
319
+ "gpt_neox.layers.6.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
320
+ "gpt_neox.layers.6.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
321
+ "gpt_neox.layers.6.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
322
+ "gpt_neox.layers.6.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
323
+ "gpt_neox.layers.6.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
324
+ "gpt_neox.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
325
+ "gpt_neox.layers.7.attention.bias": "model-00001-of-00003.safetensors",
326
+ "gpt_neox.layers.7.attention.dense.bias": "model-00001-of-00003.safetensors",
327
+ "gpt_neox.layers.7.attention.dense.weight": "model-00001-of-00003.safetensors",
328
+ "gpt_neox.layers.7.attention.masked_bias": "model-00001-of-00003.safetensors",
329
+ "gpt_neox.layers.7.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
330
+ "gpt_neox.layers.7.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
331
+ "gpt_neox.layers.7.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
332
+ "gpt_neox.layers.7.input_layernorm.bias": "model-00001-of-00003.safetensors",
333
+ "gpt_neox.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
334
+ "gpt_neox.layers.7.mlp.dense_4h_to_h.bias": "model-00001-of-00003.safetensors",
335
+ "gpt_neox.layers.7.mlp.dense_4h_to_h.weight": "model-00001-of-00003.safetensors",
336
+ "gpt_neox.layers.7.mlp.dense_h_to_4h.bias": "model-00001-of-00003.safetensors",
337
+ "gpt_neox.layers.7.mlp.dense_h_to_4h.weight": "model-00001-of-00003.safetensors",
338
+ "gpt_neox.layers.7.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
339
+ "gpt_neox.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
340
+ "gpt_neox.layers.8.attention.bias": "model-00001-of-00003.safetensors",
341
+ "gpt_neox.layers.8.attention.dense.bias": "model-00001-of-00003.safetensors",
342
+ "gpt_neox.layers.8.attention.dense.weight": "model-00001-of-00003.safetensors",
343
+ "gpt_neox.layers.8.attention.masked_bias": "model-00001-of-00003.safetensors",
344
+ "gpt_neox.layers.8.attention.query_key_value.bias": "model-00001-of-00003.safetensors",
345
+ "gpt_neox.layers.8.attention.query_key_value.weight": "model-00001-of-00003.safetensors",
346
+ "gpt_neox.layers.8.attention.rotary_emb.inv_freq": "model-00001-of-00003.safetensors",
347
+ "gpt_neox.layers.8.input_layernorm.bias": "model-00001-of-00003.safetensors",
348
+ "gpt_neox.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
349
+ "gpt_neox.layers.8.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
350
+ "gpt_neox.layers.8.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
351
+ "gpt_neox.layers.8.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
352
+ "gpt_neox.layers.8.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
353
+ "gpt_neox.layers.8.post_attention_layernorm.bias": "model-00001-of-00003.safetensors",
354
+ "gpt_neox.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
355
+ "gpt_neox.layers.9.attention.bias": "model-00002-of-00003.safetensors",
356
+ "gpt_neox.layers.9.attention.dense.bias": "model-00002-of-00003.safetensors",
357
+ "gpt_neox.layers.9.attention.dense.weight": "model-00002-of-00003.safetensors",
358
+ "gpt_neox.layers.9.attention.masked_bias": "model-00002-of-00003.safetensors",
359
+ "gpt_neox.layers.9.attention.query_key_value.bias": "model-00002-of-00003.safetensors",
360
+ "gpt_neox.layers.9.attention.query_key_value.weight": "model-00002-of-00003.safetensors",
361
+ "gpt_neox.layers.9.attention.rotary_emb.inv_freq": "model-00002-of-00003.safetensors",
362
+ "gpt_neox.layers.9.input_layernorm.bias": "model-00002-of-00003.safetensors",
363
+ "gpt_neox.layers.9.input_layernorm.weight": "model-00002-of-00003.safetensors",
364
+ "gpt_neox.layers.9.mlp.dense_4h_to_h.bias": "model-00002-of-00003.safetensors",
365
+ "gpt_neox.layers.9.mlp.dense_4h_to_h.weight": "model-00002-of-00003.safetensors",
366
+ "gpt_neox.layers.9.mlp.dense_h_to_4h.bias": "model-00002-of-00003.safetensors",
367
+ "gpt_neox.layers.9.mlp.dense_h_to_4h.weight": "model-00002-of-00003.safetensors",
368
+ "gpt_neox.layers.9.post_attention_layernorm.bias": "model-00002-of-00003.safetensors",
369
+ "gpt_neox.layers.9.post_attention_layernorm.weight": "model-00002-of-00003.safetensors"
370
+ }
371
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33ab71caac3a04a2e62ac20e5ffc9f2c58c89e2409f57e42b86ea53688772c3c
3
+ size 1125308122
special_tokens_map.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|sep|>",
5
+ "<|acc|>",
6
+ "<|tel|>",
7
+ "<|rrn|>"
8
+ ],
9
+ "eos_token": "<|endoftext|>",
10
+ "pad_token": "<|endoftext|>"
11
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "name_or_path": "EleutherAI/polyglot-ko-1.3b",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "tokenizer_class": "PreTrainedTokenizerFast"
6
+ }