No BOS token added: upstream fix not implemented

#2
by riskybiscuit - opened

There was an issue with the Llama 3 tokenizer in that it didn't automatically add the BOS token. This has since been fixed in the original repository, but is still the case in this repository.

Example:

tokenizer_astronomer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Special-Tokens-Adjusted")
tokenizer_meta = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token=huggingface_access_token)

print(tokenizer_astronomer("Hi")["input_ids"])
> [13347]
print(tokenizer_meta("Hi")["input_ids"])
> [128000, 13347]

128000 is Llama 3's BOS token <|begin_of_text|>.

If possible, it would be amazing if you could incorporate the upstream fixes into this repo.

The relevant discussions:

https://huggingface.co/meta-llama/Meta-Llama-3-70B/discussions/6
https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/9

Astronomer org

Thanks for bringing this up. I just integrated the changes to tokenizer.json file for both the special tokens adjusted 70B model and 8B model (base on this change to the Llama 3 repo https://huggingface.co/meta-llama/Meta-Llama-3-70B/discussions/8/files).

Please let me know if you run into any issues with this

Great, thanks so much! Using the latest version of this repo, I now get the expected behaviour for the tokenizer:

tokenizer("Hi")
> {'input_ids': [128000, 13347], 'attention_mask': [1, 1]}
riskybiscuit changed discussion status to closed

Sign up or log in to comment