Update README.md
Browse files
README.md
CHANGED
@@ -41,6 +41,26 @@ Llama-2-Ko is an auto-regressive language model that uses an optimized transform
|
|
41 |
- Original Llama-2: 32000 Sentencepiece BPE
|
42 |
- **Expanded Llama-2-ko: 46336** Sentencepiece BPE
|
43 |
- New vocab and merges, trained with Korean Corpus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
---
|
46 |
|
|
|
41 |
- Original Llama-2: 32000 Sentencepiece BPE
|
42 |
- **Expanded Llama-2-ko: 46336** Sentencepiece BPE
|
43 |
- New vocab and merges, trained with Korean Corpus
|
44 |
+
- Tokenizer Examples: Llama-2 vs **Llama-2-Ko**
|
45 |
+
- Use the same tokenization for English, but a shorter/merged tokenization for Korean.
|
46 |
+
- Tokenize "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ฐธ ์ข๋ค์."
|
47 |
+
- Llama-2:
|
48 |
+
```
|
49 |
+
['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์']
|
50 |
+
```
|
51 |
+
- **Llama-2-Ko**:
|
52 |
+
```
|
53 |
+
['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์']
|
54 |
+
```
|
55 |
+
- Tokenize "Llama 2: Open Foundation and Fine-Tuned Chat Models"
|
56 |
+
- Llama-2:
|
57 |
+
```
|
58 |
+
['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']
|
59 |
+
```
|
60 |
+
- **Llama-2-Ko**:
|
61 |
+
```
|
62 |
+
['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']
|
63 |
+
```
|
64 |
|
65 |
---
|
66 |
|