File size: 4,048 Bytes
65bf81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6600bcd
 
 
 
41ddc9f
65bf81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
extra_gated_heading: Access beomi/Yi-Ko-6B on Hugging Face
extra_gated_button_content: Submit
extra_gated_fields:
  I agree to share my name, email address and username: checkbox
  I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox
language:
- en
- ko
pipeline_tag: text-generation
inference: false
tags:
- pytorch
- Yi-Ko
- 01-ai
- Yi
library_name: transformers
---

> 🚧 Note: this repo is under construction, will release first version within ~1 weeks 🚧

> Update @ 2023.11.30 Test version of Yi-Ko(KoEN)-6B model

# **beomi/Yi-Ko-6B**

Yi-Ko series models serve as advanced iterations of 01-ai/Yi models, 
benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining. 
Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
This repository focuses on the **6B** pretrained version,
which is tailored to fit the Hugging Face Transformers format. 
For access to the other models, feel free to consult the index provided below.

## Model Details

**Model Developers** Junbum Lee (Beomi)

**Variations** Yi-Ko series will come in a range of parameter sizes — 6B and 34B variations.

**Input** Models input text only.

**Output** Models generate text only.

**Model Architecture** 

Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.

<small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>

|Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Batch Size(per step)|
|---|---|---|---|---|---|---|---|
|Yi-Ko-6B|*A mix of Korean + English online data*|6B|4k|O|TBD(under training)|5e<sup>-5</sup>|2048|

**Vocab Expansion**

| Model Name | Vocabulary Size | Description | 
| --- | --- | --- |
| Original Yi-Series | 64000 | Sentencepiece BPE |
| **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |

**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**

| Model | # of tokens | Tokens |
| --- | --- | --- |
| Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
| **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>||

**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**

| Model | # of tokens | Tokens |
| --- | --- | --- |
| Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
| **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|

# **Model Benchmark**

## LM Eval Harness - Korean (polyglot branch)

TBD

## LICENSE

TBD

## Citation

TBD

## Acknowledgement

The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.