bigcode/santacoder · Update eos_token_id / bos_token

Feb 8, 2023

The current values don't appear to even be in the vocab. This makes it match the recent updates done to the tokenizer config.

Update eos_token_id / bos_token_id in config.jsonfb44eff3

BenjaminB

Mar 23, 2023

Can confirm that this may cause bugs. To reproduce, take the example from the README and pass min_new_tokens to generate:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/santacoder"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, min_new_tokens=5)  # <== change

# raises IndexError: index 50256 is out of bounds for dimension 1 with size 49280

Passing min_new_tokens results in the logits_processor containing a MinNewTokensLengthLogitsProcessor with the incorrect eos_token_id, which causes the IndexError. Other additional arguments may have similar effects, I haven't tested.

loubnabnl

BigCode org Apr 23, 2023

thanks for the fix!

loubnabnl changed pull request status to merged Apr 23, 2023