lack of digit splitting in slow version of tokenizer

#1
by sanderland - opened

slowtok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=False)
fasttok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=True)

slowtok.encode('351')
[18113]

fasttok.encode('351')
[18, 20, 16]

Stability AI org

Thanks for reporting this, @sanderland ! Looks like GPT2Tokenizer (slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json. I've updated the tokenizer config to force the fast class as a temporary fix.

Sign up or log in to comment