Full 1-1 compat with dbrx-instruct tokenizer?

#1
by Qubitium - opened

@Xenova Is this full 1-1, forward/reverse, encode/decode exactly behaves like the tiktoken tokenizer in dbrx-instruct? Thank you.

Owner

It should be! We've tested it on the entire xnli dataset (all languages), and it produces the same result.

@Xenova Thank you. We made a base-tokenizer based on your code with slight modifications to included pad token and to tokens that the original tokenizer added dynamically but never added to encoder. https://huggingface.co/LnL-AI/dbrx-base-tokenizer

Sign up or log in to comment