Configuration confusion

#5
by krao - opened

I am confused about the correct EOS/BOS.

In generation_config.json and config.json the settings differ:

latter:

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 7168,
  "initializer_range": 0.02,
  "intermediate_size": 20480,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 56,
  "num_hidden_layers": 60,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 5000000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.35.0",
  "use_cache": true,
  "vocab_size": 64000
}

former:

{
  "bos_token_id": 6,
  "do_sample": true,
  "eos_token_id": 7,
  "pad_token_id": 0,
  "temperature": 0.6,
  "max_length": 4096,
  "top_p": 0.8,
  "transformers_version": "4.35.0"
}

If I understand https://github.com/huggingface/transformers/issues/25395#issuecomment-1677796723 correctly, it's a fallback mechanism.
That would give

  "bos_token_id": 6,
  "eos_token_id": 7,

However, maybe they are not added at all, as in tokenizer_config.json we have:

{
  "add_bos_token": false,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<|startoftext|>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "8": {
      "content": "<|im_sep|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|im_sep|>"
  ],
  "bos_token": "<|startoftext|>",
  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "legacy": true,
  "model_max_length": 4096,
  "pad_token": "<unk>",
  "padding_side": "right",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": true
}

note the:

  "add_bos_token": false,
  "add_eos_token": false,

And thus, why are bos_token and eos_token still specified here?

Hi @krao
The generate function will use eos token as stop token. In our chat template format(which is ChatML), we use the "<|im_end|>" as the end token of the response. So we change the eos_token_id in generation_config to 7 which map to "<|im_end|>".
And yes, the bos_token_id should have no effect here.

And yes, the bos_token_id should have no effect here.

Do you mean bos_token and eos_token?

The chat model is developed upon the base model, which utilizes distinct training templates:

  • base model: Typically trained with a template such as "{document}<|endoftext|>", To format this appropriately, one can employ tokenizer.encode(document, add_bos_token=add_bos_token, add_eos_token=add_eos_token), and designate "<|endoftext|>" as the stop token during generation.
  • chat model: Often trained using a template represented by "<|im_start|>...<|im_end|>", For proper formatting, the method tokenizer.apply_chat_template(messages) is used, and designate "<|im_end|>" as stop token during generation.

It's important to note that the bos and eos settings found in config.json and tokenizer_config.json are inherited from the base model. However, the settings in generation_config.json are specifically defined by the chat model.

If you have any further question, feel free to ask!

Thank you, that was really helpful!

krao changed discussion status to closed

Hi, can I just modify the bos and eos setting in config.json and tokenizer_config.json and make it align with generation_config.json? If yes, I can make a PR.

Sign up or log in to comment