Add config, tokenizer, and README
Browse files- README.md +120 -0
- config.json +23 -0
- generation_config.json +7 -0
- special_tokens_map.json +5 -0
- tokenizer.json +0 -0
- tokenizer_config.json +8 -0
README.md
CHANGED
@@ -1,3 +1,123 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- ko
|
5 |
+
- en
|
6 |
+
metrics:
|
7 |
+
- perplexity
|
8 |
+
- accuracy
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
tags:
|
11 |
+
- llama
|
12 |
+
- KoLLAMA
|
13 |
+
- KoreanGPT
|
14 |
---
|
15 |
+
|
16 |
+
> 🚧 Note: this repo is under construction 🚧
|
17 |
+
|
18 |
+
## Todo
|
19 |
+
|
20 |
+
✅ - finish
|
21 |
+
|
22 |
+
⏳ - currently working on it
|
23 |
+
|
24 |
+
- ✅ Train new BBPE Tokenizer
|
25 |
+
- ✅ Test train code on TPUv4 Pods (with model parallel)
|
26 |
+
- ✅ Converting test (jax to PyTorch)
|
27 |
+
- ✅ LM train validation on minimal dataset (1 sentence 1000 step)
|
28 |
+
- ⏳ Build Data Shuffler (curriculum learning)
|
29 |
+
- ⏳ Train 7B Model
|
30 |
+
- ⏳ Train 13B Model
|
31 |
+
- ⏳ Train 33B Model
|
32 |
+
- Train 65B Model
|
33 |
+
|
34 |
+
|
35 |
+
# KoLLaMA Model Card
|
36 |
+
|
37 |
+
KoLLaMA (30B) trained on Korean/English/Code dataset with LLaMA Architecture via JAX,
|
38 |
+
with the warm support from [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing part of the computation resources.
|
39 |
+
|
40 |
+
## Model details
|
41 |
+
|
42 |
+
**Researcher developing the model**
|
43 |
+
|
44 |
+
Junbum Lee (aka Beomi)
|
45 |
+
|
46 |
+
**Model date**
|
47 |
+
|
48 |
+
KoLLaMA was trained between 2023.04~
|
49 |
+
|
50 |
+
- 33B model was trained on 2023.07~
|
51 |
+
|
52 |
+
**Model version**
|
53 |
+
|
54 |
+
This is alpha version of the model.
|
55 |
+
|
56 |
+
**Model type**
|
57 |
+
|
58 |
+
LLaMA is an auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.
|
59 |
+
|
60 |
+
(This repo contains 33B model!)
|
61 |
+
|
62 |
+
**Paper or resources for more information**
|
63 |
+
|
64 |
+
More information can be found in the paper “LLaMA, Open and Efficient Foundation Language Models”, available at https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/.
|
65 |
+
|
66 |
+
**Citations details**
|
67 |
+
|
68 |
+
KoLLAMA: [TBD]
|
69 |
+
LLAMA: https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/
|
70 |
+
|
71 |
+
**License**
|
72 |
+
|
73 |
+
MIT
|
74 |
+
|
75 |
+
**Where to send questions or comments about the model**
|
76 |
+
|
77 |
+
Questions and comments about KoLLaMA can be sent via the [GitHub repository](https://github.com/beomi/KoLLAMA) of the project , by opening an issue.
|
78 |
+
|
79 |
+
## Intended use
|
80 |
+
**Primary intended uses**
|
81 |
+
|
82 |
+
The primary use of KoLLaMA is research on Korean Opensource large language models
|
83 |
+
|
84 |
+
**Primary intended users**
|
85 |
+
|
86 |
+
The primary intended users of the model are researchers in natural language processing, machine learning and artificial intelligence.
|
87 |
+
|
88 |
+
**Out-of-scope use cases**
|
89 |
+
|
90 |
+
LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.
|
91 |
+
|
92 |
+
## Factors
|
93 |
+
|
94 |
+
**Relevant factors**
|
95 |
+
|
96 |
+
One of the most relevant factors for which model performance may vary is which language is used. Although we included 20 languages in the training data, most of our dataset is made of English text, and we thus expect the model to perform better for English than other languages. Relatedly, it has been shown in previous studies that performance might vary for different dialects, and we expect that it will be the case for our model.
|
97 |
+
|
98 |
+
## Evaluation datasets
|
99 |
+
|
100 |
+
[TBD]
|
101 |
+
|
102 |
+
|
103 |
+
## Training dataset
|
104 |
+
|
105 |
+
[TBD]
|
106 |
+
|
107 |
+
## Ethical considerations
|
108 |
+
|
109 |
+
**Data**
|
110 |
+
|
111 |
+
The data used to train the model is collected from various sources, mostly from the Web. As such, it contains offensive, harmful and biased content. We thus expect the model to exhibit such biases from the training data.
|
112 |
+
|
113 |
+
**Human life**
|
114 |
+
|
115 |
+
The model is not intended to inform decisions about matters central to human life, and should not be used in such a way.
|
116 |
+
|
117 |
+
**Risks and harms**
|
118 |
+
|
119 |
+
Risks and harms of large language models include the generation of harmful, offensive or biased content. These models are often prone to generating incorrect information, sometimes referred to as hallucinations. We do not expect our model to be an exception in this regard.
|
120 |
+
|
121 |
+
**Use cases**
|
122 |
+
|
123 |
+
LLaMA is a foundational model, and as such, it should not be used for downstream applications without further investigation and mitigations of risks. These risks and potential fraught use cases include, but are not limited to: generation of misinformation and generation of harmful, biased or offensive content.
|
config.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"LlamaForCausalLM"
|
4 |
+
],
|
5 |
+
"bos_token_id": 1,
|
6 |
+
"eos_token_id": 0,
|
7 |
+
"hidden_act": "silu",
|
8 |
+
"hidden_size": 6656,
|
9 |
+
"initializer_range": 0.02,
|
10 |
+
"intermediate_size": 17920,
|
11 |
+
"max_position_embeddings": 2048,
|
12 |
+
"max_sequence_length": 2048,
|
13 |
+
"model_type": "llama",
|
14 |
+
"num_attention_heads": 52,
|
15 |
+
"num_hidden_layers": 60,
|
16 |
+
"pad_token_id": 0,
|
17 |
+
"rms_norm_eps": 1e-06,
|
18 |
+
"tie_word_embeddings": false,
|
19 |
+
"torch_dtype": "float16",
|
20 |
+
"transformers_version": "4.28.0.dev0",
|
21 |
+
"use_cache": true,
|
22 |
+
"vocab_size": 52000
|
23 |
+
}
|
generation_config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_from_model_config": true,
|
3 |
+
"bos_token_id": 1,
|
4 |
+
"eos_token_id": 0,
|
5 |
+
"pad_token_id": 0,
|
6 |
+
"transformers_version": "4.28.0.dev0"
|
7 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "<|sep|>",
|
3 |
+
"eos_token": "<|endoftext|>",
|
4 |
+
"pad_token": "<|endoftext|>"
|
5 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"name_or_path": "beomi/KoLLAMA",
|
3 |
+
"eos_token": "<|endoftext|>",
|
4 |
+
"bos_token": "<|sep|>",
|
5 |
+
"pad_token": "<|endoftext|>",
|
6 |
+
"model_max_length": 1000000000000000019884624838656,
|
7 |
+
"tokenizer_class": "PreTrainedTokenizerFast"
|
8 |
+
}
|