philschmid
/

gemma-tokenizer-chatml

Inference Endpoints

Model card Files Files and versions Community

gemma-tokenizer-chatml / README.md

philschmid's picture

philschmid HF staff

Update README.md

f6230c8 verified 6 months ago

|

history blame contribute delete

No virus

2.31 kB

	---
	library_name: transformers
	tags: ["gemma","chatml"]
	---

	# ChatML Tokenizer for Gemma

	This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<\|im_start\|>` and `<\|im_end\|>`.

	No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.


	_Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems [google/gemma-7b](https://huggingface.co/google/gemma-7b), always requires the original `<bos>` token to be part of the input. This means the chat template is `<bos>` + `chatml` + `<eos>`_

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")

	messages = [
	{"role": "system", "content": "You are Gemma."},
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	]

	chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
	print(chatml)
	# <bos><\|im_start\|>system
	# You are Gemma.<\|im_end\|>
	# <\|im_start\|>user
	# Hello, how are you?<\|im_end\|>
	# <\|im_start\|>assistant
	# I'm doing great. How can I help you today?<\|im_end\|>\n<eos>

	```


	## Test

	```python
	tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
	original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

	# get special tokens
	print(tokenizer.special_tokens_map)
	print(original_tokenizer.special_tokens_map)

	# check length of vocab
	assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

	# tokenize messages
	messages = [
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	]

	chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
	google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)

	print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")

	```