Conversion to tiktoken

#4
by koyfman - opened

Hi @Xenova - Is there a mechanism to convert in the other direction - HF tokenizer to tiktoken model?

Owner

Not a general way that I know of, but the original gpt4 tokenizer is already in tiktoken format, which you can use. Do you have a specific tokenizer in mind?

the original gpt4 tokenizer is already in tiktoken format,

Right, I was more thinking about training a HF tokenizer from scratch and creating a tiktoken model from that. Thanks πŸ™‚

I'd love to see that, the performance gains for semantic chunking with semchunk would be great since tiktoken is much faster than transformers when it comes to tokenization.

Sign up or log in to comment