InternLM2 tokenizer(llamaified version)

Official repo: https://github.com/InternLM/InternLM

Note

This repo converts the InternLM2 tokenizer to LlamaTokenizerFast.

It also replaces the 354 token \u0000 with an emoji so that it can be converted by llama.cpp

How to use

Load

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama)

Apply chatml template

chat = [{"role": "user", "content": "Hello! What's your name?"},
        {"role": "assistant", "content": "My name is InternLM2!"},
        {"role": "user", "content": "Nice to meet you InternLM2!"},]

chat_ids = llama_tokenizer.apply_chat_template(chat)
print("ids: ", chat_ids)
print("tokens: ", llama_tokenizer.convert_ids_to_tokens(chat_ids))

# convert the chat history to a string for generation
chat_str = llama_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("chat string: ", chat_str)

ids:  [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364, 92543, 1008, 364, 44501, 442, 3531, 629, 4576, 11146, 314, 346, 92542, 364]
tokens:  ['<s>', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n', '<|im_start|>', 'user', '\n', 'Nice', '▁to', '▁meet', '▁you', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n']
chat string:  <s><|im_start|>user
Hello! What's your name?<|im_end|>
<|im_start|>assistant
My name is InternLM2!<|im_end|>
<|im_start|>user
Nice to meet you InternLM2!<|im_end|>
<|im_start|>assistant