beomi commited on
Commit
9cdd32a
โ€ข
1 Parent(s): 46198dd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - facebook
9
+ - meta
10
+ - pytorch
11
+ - llama
12
+ - llama-2
13
+ - kollama
14
+ - llama-2-ko
15
+ license: mit
16
+ library_name: transformers
17
+ ---
18
+
19
+ **Update Log**
20
+
21
+ - 2023.12.14: First Release of Open-Llama-2-Ko
22
+
23
+ # **Open-Llama-2-Ko** ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท
24
+
25
+ Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
26
+ Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
27
+ This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
28
+
29
+ The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
30
+ including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
31
+
32
+ ## Model Details
33
+
34
+ **Model Developers** Junbum Lee (Beomi)
35
+
36
+ **Variations** Open-Llama-2-Ko will come in a range of parameter sizes โ€” 7B and 13B โ€” as well as pretrained variations.
37
+
38
+ **Input** Models input text only.
39
+
40
+ **Output** Models generate text only.
41
+
42
+ **Model Architecture**
43
+
44
+ Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
45
+
46
+ ||Training Data|Params|Content Length|GQA|Tokens|LR|
47
+ |---|---|---|---|---|---|---|
48
+ |Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|4k|&#10007;|>15B*|5e<sup>-5</sup>|
49
+
50
+ **Train Corpus**
51
+
52
+ TBD
53
+
54
+ **Vocab Expansion**
55
+
56
+ | Model Name | Vocabulary Size | Description |
57
+ | --- | --- | --- |
58
+ | Original Llama-2 | 32000 | Sentencepiece BPE |
59
+ | **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
60
+
61
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."**
62
+
63
+ | Model | Tokens |
64
+ | --- | --- |
65
+ | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']` |
66
+ | Llama-2-Ko | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']` |
67
+
68
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
69
+
70
+ | Model | Tokens |
71
+ | --- | --- |
72
+ | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
73
+ | Llama-2-Ko | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
74
+
75
+ # **Model Benchmark**
76
+
77
+ ## LM Eval Harness - Korean (polyglot branch)
78
+
79
+ - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
80
+
81
+ TBD
82
+
83
+ ## Note for oobabooga/text-generation-webui
84
+
85
+ Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
86
+
87
+ ```python
88
+ diff --git a/modules/models.py b/modules/models.py
89
+ index 232d5fa..de5b7a0 100644
90
+ --- a/modules/models.py
91
+ +++ b/modules/models.py
92
+ @@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
93
+ trust_remote_code=shared.args.trust_remote_code,
94
+ use_fast=False
95
+ )
96
+ - except ValueError:
97
+ + except:
98
+ tokenizer = AutoTokenizer.from_pretrained(
99
+ path_to_model,
100
+ trust_remote_code=shared.args.trust_remote_code,
101
+ ```
102
+
103
+ Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
104
+ it is required to use `use_fast=True` option when initialize tokenizer.
105
+
106
+ Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
107
+
108
+ ## Citation
109
+
110
+ TBD
111
+
112
+ ## Acknowledgement
113
+
114
+ - The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
115
+ - The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).