Bloom's tokenizer vocab is messy code

by ShaneSue - opened

anyone know how to fix it?

BigScience Workshop org

The tokenizer operates on bytes, so it's normal for the tokens to contain weird characters. If your goal is to manually inspect individual tokens you can convert them back to strings using the tokenizer's convert_tokens_to_string method.

I got it, thanks a lot

christopher changed discussion status to closed

Sign up or log in to comment