mlabonne/NeuralMarcoro14-7B · Some suggestions for chat

Jan 10, 2024

Hello! Could you please provide the field "chat_template":"{{ bos_token }}..." in tokenizer_config.json? Because currently I deployed this model using vllm and found that its output due to vllm's built-in mechanism of using chat_template, if this chat_template field is not provided, vllm by default uses a set of chat_templates that it comes with, which leads to a very bad problem where the model output becomes something like the following like this:

Human
There is a cat and a chicken in the box. How many feet do these two animals have?
AI
To calculate the number of feet these two animals have, we consider each animal's typical limb count. A cat has four legs, and a chicken also has two legs. When we combine their limbs, there are 4 (cat's legs) + 2 (chicken's legs) = 6 feet in total from both animals. [INST]<<SYS>>In this given situation with a cat and a chicken inside a box, we determine the total number of feet they possess. Each animal contributes differently to this figure, as cats have four legs and chickens have two legs. As we combine these leg counts, the total number of feet comes to 6 feet - a sum of 4 feet from the cat and 2 feet from the chicken.<<SYS>>]

As you can see the output comes with a lot of [INST] such lexical elements, which is not a good experience to use. I have experienced the gguf model you deployed and it works fantastic, thank you very much for your efforts to contribute!

Or can you provide a copy of the deployment guide on vllm? A set of github repositories for easy deployment, similar to openchat-3.5, would be a great help to make your model available to a wider audience! Thanks again!

mlabonne

Owner Jan 10, 2024

Hi, thanks for your interest! Can you provide an example of a model with a tokenizer_config.json that works with vllm, so I can copy it?

Giskard

Jan 10, 2024

You can check your own:

https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B/blob/364dcb71549e6988899ba581cd707d3392e96c1c/tokenizer_config.json#L48

I think it cames from here:
https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B#prompt-format

and then:

https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B/blob/91ed666be78da7556f3d79abbb26fff0ee26cb54/tokenizer_config.json#L52

mlabonne

Owner Jan 10, 2024

Thanks, done!

FlorianJc

Jan 10, 2024

•

edited Jan 10, 2024

I don't work in my case with vllm.
The model seems to not generate the eos token (</s>) even if I manually add it in the prompt template.

As I'm mainly focused on deployment development maybe I'm wrong, but isn't the eos token should be added in the training data right before <|im_end|> ?

https://huggingface.co/datasets/mlabonne/chatml_dpo_pairs

mlabonne

Owner Jan 10, 2024

Hi, I can look into the issue when I have more time but <|im_end|> is the EOS token. See my code in Phixtral Chat:

FlorianJc

Jan 11, 2024

•

edited Jan 11, 2024

Ok, so the problem is that vllm use the tokenizer of transformers and the stop condition is when the last generated token match tokenizer.eos_token_id (id=2)

And <|im_end|> is encoded as [ 523, 28766, 321, 28730, 416, 28766, 28767 ]

vLLM support a non OpenAI standard variable in its requests (stop_token_ids) and if I assign the previous list, the model stop generating almost good.

It's not really good because the stop condition is catched when only one token id is matched and not the full list, so all theses tokens become prohibited separately.
If you're curious, you can look at the _check_stop function at vllm/engine/llm_engine.py

Plus the value of "<" is not always 523 depending on the previous character.

So in my opinion, the stop token should always be </s> (which can be encoded in only one and non ambigous token btw).

P.S: Je suis français et je sais pas si mes explications sont claires, je peux re-expliquer si besoin ;)
Encore une fois je connais mal l'apprentissage mais je connais bien vLLM.

Liangmingxin

Jan 11, 2024

Ok, so the problem is that vllm use the tokenizer of transformers and the stop condition is when the last generated token match tokenizer.eos_token_id (id=2)

And <|im_end|> is encoded as [ 523, 28766, 321, 28730, 416, 28766, 28767 ]

vLLM support a non OpenAI standard variable in its requests (stop_token_ids) and if I assign the previous list, the model stop generating almost good.

It's not really good because the stop condition is catched when only one token id is matched and not the full list, so all theses tokens become prohibited separately.
If you're curious, you can look at the _check_stop function at vllm/engine/llm_engine.py

Plus the value of "<" is not always 523 depending on the previous character.

So in my opinion, the stop token should always be </s> (which can be encoded in only one and non ambigous token btw).

P.S: Je suis français et je sais pas si mes explications sont claires, je peux re-expliquer si besoin ;)
Encore une fois je connais mal l'apprentissage mais je connais bien vLLM.

我是中国人，英语也不是很好，哈哈哈

Liangmingxin changed discussion status to closed Jan 11, 2024

Liangmingxin

Jan 11, 2024

Ok, so the problem is that vllm use the tokenizer of transformers and the stop condition is when the last generated token match tokenizer.eos_token_id (id=2)

And <|im_end|> is encoded as [ 523, 28766, 321, 28730, 416, 28766, 28767 ]

vLLM support a non OpenAI standard variable in its requests (stop_token_ids) and if I assign the previous list, the model stop generating almost good.

It's not really good because the stop condition is catched when only one token id is matched and not the full list, so all theses tokens become prohibited separately.
If you're curious, you can look at the _check_stop function at vllm/engine/llm_engine.py

Plus the value of "<" is not always 523 depending on the previous character.

So in my opinion, the stop token should always be </s> (which can be encoded in only one and non ambigous token btw).

P.S: Je suis français et je sais pas si mes explications sont claires, je peux re-expliquer si besoin ;)
Encore une fois je connais mal l'apprentissage mais je connais bien vLLM.

When I use vllm to derive this model, even after updating the chat_template, it still often appears that the model doesn't stop outputting (Chinese Q&A) after answering the question reasonably well, resulting in a lot of redundant INSTRUCTIONS, etc., but this problem doesn't occur when using the GGUF page provided by the respondent, so I'm a bit confused. I initially thought it was a chat_template issue, and after switching -- chat_template to use vllm-project/vllm/blob/main/examples/template_alpaca.jinja the problem got a little bit better but the quality of the answers went down. Maybe it's not a problem with chat_template, but the model uses a lot of non-uniform data formats, causing it to have an unstable answer format? What do you think about that? Do you have any further suggestions about vllm deploying this model? Guidance is much appreciated!

Liangmingxin changed discussion status to open Jan 11, 2024

Liangmingxin

Jan 11, 2024

•

edited Jan 11, 2024

In order to better debug and resolve this issue, I deployed a login-free page for testing purposes only.
Hardware: RTX 2080ti
Inference framework: vllm v0.2.7
Deployment Instructions:

python ./vllm/vllm/entrypoints/openai/api_server.py \
--model './NeuralMarcoro14-7B' \
--tokenizer './NeuralMarcoro14-7B' \
--tokenizer-mode auto \
--dtype float16 \
--enforce-eager \
--tensor-parallel-size 2 \
--trust-remote-code

I set some default parameters:

"use_beam_search": true,
  "temperature": 0.7
  "stop_token_ids": [2],
  "skip_special_tokens": true,
  "add_generation_prompt": true,
  "min_p": 0.05

https://fast.connectwithgpt.com/chat/share?shareId=04k2p7on33osfd8jj8of8nm4
This link is valid until 2024-01-13 00:00 CST, please do not enter personal or confidential information.

FlorianJc

Jan 11, 2024

I noticed something strange.
NeuralHermes training seems to use the same dataset:
https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B

But NeuralHermes perform well with vLLM.
Maybe something changed in the training script @mlabonne ?

Liangmingxin

Jan 11, 2024

•

edited Jan 11, 2024

I noticed something strange.
NeuralHermes training seems to use the same dataset:
https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B

But NeuralHermes perform well with vLLM.
Maybe something changed in the training script @mlabonne ?

Maybe I need to set <|im_end|> exactly instead of "stop_token_ids": [2], (aka )? @FlorianJc
I just found NeuralMarcoro14-7B/tokenizer_config.json

"clean_up_tokenization_spaces": false,
  "eos_token":"</s>",
  "legacy": true,

And NeuralHermes-2.5-Mistral-7B/tokenizer_config.json is

"clean_up_tokenization_spaces": false,
  "eos_token":"<|im_end|>",
  "legacy": true.

Hahaha, maybe that's the answer? I'll try!

Liangmingxin

Jan 11, 2024

I put NeuralMarcoro14-7B/tokenizer_config.json

"clean_up_tokenization_spaces": false,
  "eos_token":"</s>",
  "legacy": true,

Changed to be the same as NeuralHermes-2.5-Mistral-7B/tokenizer_config.json

"clean_up_tokenization_spaces": false,
  "eos_token":"<|im_end|>",
  "legacy": true.

But it doesn't work, and the model doesn't break the output until it reaches the max_tokens... Sad.

Liangmingxin

Jan 11, 2024

•

edited Jan 11, 2024

I changed "eos_token":"<|im_end|>", back to "eos_token":"", it seems to be a bit better, but still occasionally outputs garbled text, why does the answerer's gguf formatted model doesn't output so much garble, but the vllm deployed one does?

mlabonne

Owner Jan 11, 2024

Sorry, I have no idea. I don't know how vllm handles that. I would also try to copy the entire config from NeuralHermes but if that doesn't work either... :(

Liangmingxin

Jan 11, 2024

•

edited Jan 11, 2024

Sorry, I have no idea. I don't know how vllm handles that. I would also try to copy the entire config from NeuralHermes but if that doesn't work either... :(

After a full night of debugging and modifying vllm's deployment parameters, I think it's much better! There are no strange special tokens now, but it still occasionally outputs garbled code when it encounters rare issues, but it's not too much of a problem anymore. I've posted my vllm deployment below for your reference @mlabonne

python ./vllm/vllm/entrypoints/openai/api_server.py \
--model './NeuralMarcoro14-7B' \
--tokenizer './NeuralMarcoro14-7B' \
--tokenizer-mode auto \
--dtype float16 \
--enforce-eager \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port xxxx \
--trust-remote-code \
--disable-log-stats \
--disable-log-requests

Within vllm I set some default parameters

    default_min_p = 0.05
    default_use_beam_search = True
    default_ignore_eos = False
    default_skip_special_tokens = True

Then tokenizer_config.json I copied https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B, It's mostly unchanged, but I've changed here "eos_token": "< /s>"

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32000": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32001": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "trust_remote_code": false,
  "unk_token": "<unk>",
  "use_default_system_prompt": true,
  "use_fast": true
}

Really like this model of yours! You're invited to try it as I deployed it with vllm (link is still the one above).

mlabonne

Owner Jan 12, 2024

Excellent, thank you for providing all these details! It'll be useful as a reference in the future.

Giskard

Jan 13, 2024

Maybe this is something similar to what was commented in here in the secind parqgraph?

https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser#training

mlabonne
/

NeuralMarcoro14-7B

Some suggestions for chat_template