HF-compatible tokenizer

#2
by Xenova HF staff - opened

Hey there! You might be interested in my transformers-compatible version of the grok-1 tokenizer, https://huggingface.co/Xenova/grok-1-tokenizer, which can be used as follows:

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained('Xenova/grok-1-tokenizer')
inputs = tokenizer('hello world')

This should be able to simplify the example code quite a bit.

Also, feel free to copy the tokenizer.json and tokenizer_config.json to your repo, and then you can use your model ID.

HPC-AI Technology org

Hey @Xenova ! Thank you for your work on the transformers-compatible version of the tokenizer! We now have a link to your HuggingFace Hub and use your tokenizer in our example: https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/grok-1/inference_tp.py

This is lovely!

Hello can this be set in tokenizer_config.json?

@philschmid maybe there could be a ChatML-compatible tokenizer for Grok? 😁 πŸ™

HPC-AI Technology org
β€’
edited Mar 27, 2024

Hey @Xenova , would you like to take a look at this PR, which includes tokenizer-related file changes, opened by @ArthurZ ? Not sure if it's the same as your grok-1 tokenizer or you guys together worked on it, so just wanna check with you.

Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.

HPC-AI Technology org

Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.

Thank you @Xenova ! I've merged the PR and tested on it. That works well!

Please close this

Xenova changed discussion status to closed

Sign up or log in to comment