Update post-processor to add bos

#41
by pcuenq HF staff - opened
Meta Llama org
No description provided.

@pcuenq It looks like this might be missing the end of turn token:

Edit: I'm dumb, this comment should be for the instruct model https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/42/files Sorry!

https://github.com/meta-llama/llama3/blob/main/llama/generation.py#L307
->
https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L222

    def encode_message(self, message: Message) -> List[int]:
        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

It looks like at the end of each message, the eot should be appended if I'm reading this right.

@pcuenq Would adding add_bos_token: true in tokenizer_config.json do the trick?

I tested this change, and it fixes fine-tuning of the base model. Without it the grad norm is inf and the loss is high.
I also tried just using add_bos_token: true and that did not actually add the token, at least with Axolotl.

This fixes the BOS token not being added within Axoltol.

Axoltol Config

datasets:
  - path: PJMixers/example-sharegpt
    type: sharegpt
    conversation: chatml

Without the PR

Screenshot from 2024-04-22 13-50-27.png

With the PR

Screenshot from 2024-04-22 13-58-22.png

Meta Llama org

Thanks for the confirmations, merging now!

pcuenq changed pull request status to merged

Sign up or log in to comment