beomi commited on
Commit
65bf81e
1 Parent(s): 07cf465

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ extra_gated_heading: Access beomi/Yi-Ko-6B on Hugging Face
3
+ extra_gated_button_content: Submit
4
+ extra_gated_fields:
5
+ I agree to share my name, email address and username: checkbox
6
+ I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox
7
+ language:
8
+ - en
9
+ - ko
10
+ pipeline_tag: text-generation
11
+ inference: false
12
+ tags:
13
+ - pytorch
14
+ - Yi-Ko
15
+ - 01-ai
16
+ - Yi
17
+ library_name: transformers
18
+ ---
19
+
20
+ # **beomi/Yi-Ko-34B**
21
+
22
+ Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
23
+ benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
24
+ Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
25
+ This repository focuses on the **6B** pretrained version,
26
+ which is tailored to fit the Hugging Face Transformers format.
27
+ For access to the other models, feel free to consult the index provided below.
28
+
29
+ ## Model Details
30
+
31
+ **Model Developers** Junbum Lee (Beomi)
32
+
33
+ **Variations** Yi-Ko series will come in a range of parameter sizes — 6B and 34B variations.
34
+
35
+ **Input** Models input text only.
36
+
37
+ **Output** Models generate text only.
38
+
39
+ **Model Architecture**
40
+
41
+ Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
42
+
43
+ <small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
44
+
45
+ |Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Batch Size(per step)|
46
+ |---|---|---|---|---|---|---|---|
47
+ |Yi-Ko-6B|*A mix of Korean + English online data*|6B|4k|O|TBD(under training)|5e<sup>-5</sup>|2048|
48
+
49
+ **Vocab Expansion**
50
+
51
+ | Model Name | Vocabulary Size | Description |
52
+ | --- | --- | --- |
53
+ | Original Yi-Series | 64000 | Sentencepiece BPE |
54
+ | **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
55
+
56
+ **Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
57
+
58
+ | Model | # of tokens | Tokens |
59
+ | --- | --- | --- |
60
+ | Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
61
+ | **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
62
+ |<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
63
+
64
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
65
+
66
+ | Model | # of tokens | Tokens |
67
+ | --- | --- | --- |
68
+ | Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
69
+ | **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
70
+ |<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
71
+
72
+ # **Model Benchmark**
73
+
74
+ ## LM Eval Harness - Korean (polyglot branch)
75
+
76
+ TBD
77
+
78
+ ## LICENSE
79
+
80
+ TBD
81
+
82
+ ## Citation
83
+
84
+ TBD
85
+
86
+ ## Acknowledgement
87
+
88
+ The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.