beomi commited on
Commit
84757c3
โ€ข
1 Parent(s): 79e3a33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -41,6 +41,26 @@ Llama-2-Ko is an auto-regressive language model that uses an optimized transform
41
  - Original Llama-2: 32000 Sentencepiece BPE
42
  - **Expanded Llama-2-ko: 46336** Sentencepiece BPE
43
  - New vocab and merges, trained with Korean Corpus
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ---
46
 
 
41
  - Original Llama-2: 32000 Sentencepiece BPE
42
  - **Expanded Llama-2-ko: 46336** Sentencepiece BPE
43
  - New vocab and merges, trained with Korean Corpus
44
+ - Tokenizer Examples: Llama-2 vs **Llama-2-Ko**
45
+ - Use the same tokenization for English, but a shorter/merged tokenization for Korean.
46
+ - Tokenize "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ฐธ ์ข‹๋„ค์š”."
47
+ - Llama-2:
48
+ ```
49
+ ['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']
50
+ ```
51
+ - **Llama-2-Ko**:
52
+ ```
53
+ ['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']
54
+ ```
55
+ - Tokenize "Llama 2: Open Foundation and Fine-Tuned Chat Models"
56
+ - Llama-2:
57
+ ```
58
+ ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']
59
+ ```
60
+ - **Llama-2-Ko**:
61
+ ```
62
+ ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']
63
+ ```
64
 
65
  ---
66