Tokenizer doesn't add EOS token even when explicitly requested

#90
by johngiorgi - opened

I can't seem to get the tokenizer to add the EOS token, even when I explicitly request it. Reproduction below with a fresh download of the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    add_eos_token=True,
    force_download=True, 
    token=True
)
tokenizer.tokenize("this is a test", add_special_tokens=True)
>>> ['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']

I am on transformers==4.40.1 and tokenizers==0.19.1

This is awful. I hope it gets resolved soon!

Try this:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model, token=os.environ["HUGGING_FACE_TOKEN"], cache_dir=Settings.CACHE_DIR,
)
stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

llm = HuggingFaceLLM(
    model_name=model,
    model_kwargs={
        "token": os.environ["HUGGING_FACE_TOKEN"],
        "torch_dtype": torch.bfloat16,
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.01,
        "top_p": 0.9,
    },
    tokenizer_name=model,
    tokenizer_kwargs={"token": os.environ["HUGGING_FACE_TOKEN"]},
    stopping_ids=stopping_ids,
)
osanseviero changed discussion status to closed

@osanseviero Why was this closed? And with zero explanation? This is still an issue—with both the base model and the instruct model:

['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']

Sign up or log in to comment