Tokenization/decoding bug with "_"?

#13
by anttip - opened

The model sometimes outputs broken syntax, with segments starting with "_" replaced with special tokens. I noticed this first with a sequence with "_to". Now I get this generated:

df = pd.read<s> = pd.read_csv(

This is using transformers 4.33.1 and current transformers-4.34.0.dev0

The issue could be with bitsandbytes quantization, adding bnb_4bit_quant_type="nf4" to bitsandbytes config fixes the example above

Sign up or log in to comment