Qwen/Qwen-14B · It would be useful if someone spent the time to convert the tokenizer to gpt2 HF format

KnutJaegersberg

Qwen org Nov 24, 2023

https://twitter.com/Euclaise_/status/1727832132075094139

CyberTimon

Nov 27, 2023

I'm also interested in this

jklj077

Qwen org Dec 21, 2023

It is possible. vonjack has made one https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat.

But the regex rule for pre-tokenziation is different, so you will need to train the model to adapt.

In fact, the tokenizer is extended from cl100k from tiktoken, which also does't have EOS (end of sentence). Also, tiktoken is faster than HF's FastTokenizer.

If the framework you use relies only on EOS to work and doesn't offer options such as stop/stop_token_ids/..., you'd better ditch it. It is not designed for generative langauge models.

jklj077 changed discussion status to closed Dec 21, 2023