lack of digit splitting in slow version of tokenizer
#1
by
sanderland
- opened
slowtok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=False)
fasttok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=True)
slowtok.encode('351')
[18113]
fasttok.encode('351')
[18, 20, 16]
Thanks for reporting this,
@sanderland
! Looks like GPT2Tokenizer
(slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json. I've updated the tokenizer config to force the fast class as a temporary fix.