--- license: other --- # InternLM2 tokenizer(llamaified version) Official repo: https://github.com/InternLM/InternLM ## Note This repo converts the InternLM2 tokenizer to LlamaTokenizerFast. It also replaces the 354 token `\u0000` with an emoji so that it can be converted by llama.cpp ## How to use - Load ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama) ``` - Apply chatml template ```python chat = [{"role": "user", "content": "Hello! What's your name?"}, {"role": "assistant", "content": "My name is InternLM2!"}, {"role": "user", "content": "Nice to meet you InternLM2!"},] chat_ids = llama_tokenizer.apply_chat_template(chat) print("ids: ", chat_ids) print("tokens: ", llama_tokenizer.convert_ids_to_tokens(chat_ids)) # convert the chat history to a string for generation chat_str = llama_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) print("chat string: ", chat_str) ``` ``` ids: [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364, 92543, 1008, 364, 44501, 442, 3531, 629, 4576, 11146, 314, 346, 92542, 364] tokens: ['', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n', '<|im_start|>', 'user', '\n', 'Nice', '▁to', '▁meet', '▁you', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n'] chat string: <|im_start|>user Hello! What's your name?<|im_end|> <|im_start|>assistant My name is InternLM2!<|im_end|> <|im_start|>user Nice to meet you InternLM2!<|im_end|> <|im_start|>assistant ```