--- library_name: transformers tags: ["gemma","chatml"] --- # ChatML Tokenizer for Gemma This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (``) and `107` (``) with the chatML tokens `<|im_start|>` and `<|im_end|>`. No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. _Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems [google/gemma-7b](https://huggingface.co/google/gemma-7b), always requires the original `` token to be part of the input. This means the chat template is `` + `chatml` + ``_ ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml") messages = [ {"role": "system", "content": "You are Gemma."}, {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, ] chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) print(chatml) # <|im_start|>system # You are Gemma.<|im_end|> # <|im_start|>user # Hello, how are you?<|im_end|> # <|im_start|>assistant # I'm doing great. How can I help you today?<|im_end|>\n ``` ## Test ```python tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml") original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it") # get special tokens print(tokenizer.special_tokens_map) print(original_tokenizer.special_tokens_map) # check length of vocab assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length" # tokenize messages messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, ] chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}") ```