added_tokens_decoder Seem to Cause Index Errors

#1
by SeaZeeHech - opened

I think the <|im_start|> and <|im_end|> were added after another user mentioned they weren't natively in the model vocab. But the current config throws indexing errors anytime it is used. The below code - changing the vocab size - will make it run without error, but the loss is very high (~10 when other models are ~3 on my data), I presume because the new tokens are just noise. But the indexing errors are gone. I'm not sure how to just remove them from the tokenizer once instantiated.

Seems like these tokens shouldn't be added now if the model wasn't trained with them and hasn't learned them? Am I missing something? Does it work out of the box for others?

config = AutoConfig.from_pretrained(self.args.generation_model_name)
config.vocab_size += 2

generator = AutoModelForCausalLM.from_pretrained(
         'leveldevai/MarcBeagle-7B', config=config, ignore_mismatched_sizes=True,
         torch_dtype=torch.bfloat16,
         attn_implementation='flash_attention_2',
         trust_remote_code=True
)

Thanks for noticing.
This file seems to come from one of the models used in the merge, I updated a few things and it appears to be working well for me please let me know if you see anything

Sign up or log in to comment