RangiLyu
/

InternLM2-tokenizer-llama

Model card Files Files and versions Community

RangiLyu commited on Jan 25

Commit

761e1f3

•

1 Parent(s): bd303f6

Update README.md

Files changed (1) hide show

README.md +30 -26

README.md CHANGED Viewed

@@ -17,32 +17,36 @@ It also replaces the 354 token `\u0000` with an emoji so that it can be converte
 ## How to use
 - Load
-  ```python
-  from transformers import AutoTokenizer
-  tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama)
-  ```
 - Apply chatml template
-  ```python
-  chat = [{"role": "user", "content": "Hello! What's your name?"},
-          {"role": "assistant", "content": "My name is InternLM2!"}]
-  chat_ids = tokenizer.apply_chat_template(chat)
-  print("ids: ", chat_ids)
-  print("tokens: ", tokenizer.convert_ids_to_tokens(chat_ids))
-  chat_str = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
-  print("chat string: ", chat_str)
-  ```
-  ```
-  ids:  [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364]
-  tokens:  ['<s>', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n']
-  chat string:  <s><|im_start|>user
-  Hello! What's your name?<|im_end|>
-  <|im_start|>assistant
-  My name is InternLM2!<|im_end|>
-  <|im_start|>assistant
   ```

 ## How to use
 - Load
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama)
+```
 - Apply chatml template
+```python
+chat = [{"role": "user", "content": "Hello! What's your name?"},
+        {"role": "assistant", "content": "My name is InternLM2!"},
+        {"role": "user", "content": "Nice to meet you InternLM2!"},]
+chat_ids = llama_tokenizer.apply_chat_template(chat)
+print("ids: ", chat_ids)
+print("tokens: ", llama_tokenizer.convert_ids_to_tokens(chat_ids))
+# convert the chat history to a string for generation
+chat_str = llama_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+print("chat string: ", chat_str)
+```
+```
+ids:  [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364, 92543, 1008, 364, 44501, 442, 3531, 629, 4576, 11146, 314, 346, 92542, 364]
+tokens:  ['<s>', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n', '<|im_start|>', 'user', '\n', 'Nice', '▁to', '▁meet', '▁you', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n']
+chat string:  <s><|im_start|>user
+Hello! What's your name?<|im_end|>
+<|im_start|>assistant
+My name is InternLM2!<|im_end|>
+<|im_start|>user
+Nice to meet you InternLM2!<|im_end|>
+<|im_start|>assistant
   ```