Fix tokenizer_config to add bos token

#33
by eduagarcia - opened

As discussed in #9, the current HF tokenizer does not prepend the bos token (id: 128000) like in the reference implementation:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/generation.py#L256

and in their test cases:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L23

This commit changes the tokenizer_class "PreTrainedTokenizerFast" to the "LlamaTokenizer", the PreTrainedTokenizerFast doesn't support seem to support the add_bos_token flag.

before the fix:

!git clone https://github.com/meta-llama/llama3.git

from llama3.llama import Tokenizer
from transformers import AutoTokenizer
llama_tokenizer = Tokenizer("llama3/Meta-Llama-3-8B/tokenizer.model")
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

text = "This is a test sentence"

orig_enc = llama_tokenizer.encode(text, bos=True, eos=False)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
hf_enc = hf_tokenizer.encode(text)
# [2028, 374, 264, 720, 1296, 271, 52989]

after the fix:

from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", revision="refs/pr/33")

text = "This is a test sentence"

hf_enc = hf_tokenizer.encode(text)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
eduagarcia changed pull request status to closed

Sign up or log in to comment