BUG : Using `AutoTokenizer.from_pretrained`'s `.encode()` function fails to add BOS token

#21
by m18coppola - opened

The Llama-3 tokenizer's .encode() function adds a BOS token, but the Llama-3.1 tokenizer's .encode() function does not. Is this intended behavior?

Example:

from transformers import AutoTokenizer

llama_3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llama_3_1_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

text = "Hello"

print(llama_3_tokenizer.encode(text))
print(llama_3_1_tokenizer.encode(text))

Output:

[128000, 9906]
[9906]
m18coppola changed discussion status to closed

Sign up or log in to comment