It would be useful if someone spent the time to convert the tokenizer to gpt2 HF format

#9
by KnutJaegersberg - opened

I'm also interested in this

It is possible. vonjack has made one https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat.

But the regex rule for pre-tokenziation is different, so you will need to train the model to adapt.

In fact, the tokenizer is extended from cl100k from tiktoken, which also does't have EOS (end of sentence). Also, tiktoken is faster than HF's FastTokenizer.

If the framework you use relies only on EOS to work and doesn't offer options such as stop/stop_token_ids/..., you'd better ditch it. It is not designed for generative langauge models.

jklj077 changed discussion status to closed

Sign up or log in to comment