Fix tokenizer_config to add bos token

#6
by eduagarcia - opened

As discussed in the meta-llama/Meta-Llama-3-8B model, the current HF tokenizer does not prepend the bos token (id: 128000) like in the reference implementation:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/generation.py#L256

and in their test cases:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L23

This commit changes the tokenizer_class "PreTrainedTokenizerFast" to the "LlamaTokenizer", the PreTrainedTokenizerFast doesn't support seem to support the add_bos_token flag.

before the fix:

!git clone https://github.com/meta-llama/llama3.git

from llama3.llama import Tokenizer
from transformers import AutoTokenizer
llama_tokenizer = Tokenizer("llama3/Meta-Llama-3-70B/tokenizer.model")
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")

text = "This is a test sentence"

orig_enc = llama_tokenizer.encode(text, bos=True, eos=False)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
hf_enc = hf_tokenizer.encode(text)
# [2028, 374, 264, 720, 1296, 271, 52989]

after the fix:

from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B", revision="refs/pr/6")

text = "This is a test sentence"

hf_enc = hf_tokenizer.encode(text)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
Meta Llama org

Hi @eduagarcia , thanks a lot about this! We just merged #8, which doesn't require to change the tokenizer class and is the solution recommended by @ArthurZ . Changing the class may have unexpected consequences, as the Llama 3 tokenizer is very different to the one used in Llama 2. Closing these PRs now, feel free to open new discussions if you observe any issues :)

Thanks again for your contribution! 🙌

pcuenq changed pull request status to closed

Sign up or log in to comment