beomi
/

llama-2-ko-7b

Text Generation

text-generation-inference

Model card Files Files and versions Community

beomi commited on Jul 22, 2023

Commit

84757c3

•

1 Parent(s): 79e3a33

Update README.md

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -41,6 +41,26 @@ Llama-2-Ko is an auto-regressive language model that uses an optimized transform
 - Original Llama-2: 32000 Sentencepiece BPE
 - **Expanded Llama-2-ko: 46336** Sentencepiece BPE
   - New vocab and merges, trained with Korean Corpus
 ---

 - Original Llama-2: 32000 Sentencepiece BPE
 - **Expanded Llama-2-ko: 46336** Sentencepiece BPE
   - New vocab and merges, trained with Korean Corpus
+- Tokenizer Examples: Llama-2 vs **Llama-2-Ko**
+  - Use the same tokenization for English, but a shorter/merged tokenization for Korean.
+  - Tokenize "안녕하세요, 오늘은 날씨가 참 좋네요."
+    - Llama-2:
+      ```
+      ['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '씨', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '요']
+      ```
+    - **Llama-2-Ko**:
+      ```
+      ['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요']
+      ```
+  - Tokenize "Llama 2: Open Foundation and Fine-Tuned Chat Models"
+    - Llama-2:
+      ```
+      ['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']
+      ```
+    - **Llama-2-Ko**:
+      ```
+      ['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']
+      ```
 ---