beomi commited on
Commit
34cd4c5
โ€ข
1 Parent(s): 9f1bbde

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -23
README.md CHANGED
@@ -41,30 +41,24 @@ Llama-2-Ko is an auto-regressive language model that uses an optimized transform
41
 
42
  **Vocab Expansion**
43
 
44
- - Original Llama-2: 32000 Sentencepiece BPE
45
- - **Expanded Llama-2-ko: 46336** Sentencepiece BPE
46
- - New vocab and merges, trained with Korean Corpus
47
- - Tokenizer Examples: Llama-2 vs **Llama-2-Ko**
48
- - Use the same tokenization for English, but a shorter/merged tokenization for Korean.
49
- - Tokenize "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."
50
- - Llama-2:
51
- ```
52
- ['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']
53
- ```
54
- - **Llama-2-Ko**:
55
- ```
56
- ['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']
57
- ```
58
- - Tokenize "Llama 2: Open Foundation and Fine-Tuned Chat Models"
59
- - Llama-2:
60
- ```
61
- ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']
62
- ```
63
- - **Llama-2-Ko**:
64
- ```
65
- ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']
66
- ```
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  # **Model Benchmark**
70
 
 
41
 
42
  **Vocab Expansion**
43
 
44
+ | Model Name | Vocabulary Size | Description |
45
+ | --- | --- | --- |
46
+ | Original Llama-2 | 32000 | Sentencepiece BPE |
47
+ | **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."**
50
+
51
+ | Model | Tokens |
52
+ | --- | --- |
53
+ | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']` |
54
+ | Llama-2-Ko | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']` |
55
+
56
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
57
+
58
+ | Model | Tokens |
59
+ | --- | --- |
60
+ | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
61
+ | Llama-2-Ko | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
62
 
63
  # **Model Benchmark**
64