lack of digit splitting in slow version of tokenizer

by sanderland - opened Apr 9, 2024

Apr 9, 2024

slowtok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=False)
fasttok = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-12b',use_fast=True)

slowtok.encode('351')
[18113]

fasttok.encode('351')
[18, 20, 16]

jon-tow

Apr 10, 2024

Thanks for reporting this, @sanderland ! Looks like GPT2Tokenizer (slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json. I've updated the tokenizer config to force the fast class as a temporary fix.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment