The model is not performing well.

#1
by Liangmingxin - opened

I deployed the model using vllm v0.3.0 and used the default chat_temple template as well as the chatml.jinja template, neither of which resulted in a satisfactory usage experience.

python /home/myuser/vllm/vllm/entrypoints/openai/api_server.py \
--model './NeuralTrix-7B-dpo' \
--tokenizer './NeuralTrix-7B-dpo' \
--tokenizer-mode auto \
--dtype float16 \
--max-model-len 8192 \
--enforce-eager \
--tensor-parallel-size 2 \
--worker-use-ray \
--engine-use-ray

# --chat-template 'vllm/examples/template_chatml.jinja' \

image.png

Why does it score so excellently on the leaderboards, when the actual deployment is terrible?

Replaced other chat_templates still bad, what is the reason?

--chat-template 'vllm/examples/template_alpaca.jinja' \

image.png

Running it like this works for me (see picture, where C is the command to start the model, L is the log from the model and I is the inference script I ran):

Command to run (C):

python -m vllm.entrypoints.openai.api_server --model CultriX/NeuralTrix-7B-dpo --gpu-memory-utilization=1 --chat-template vllm/examples/template_chatml.jinja --tokenizer-mode auto --dtype float16 --max-model-len 8192 --enforce-eager --tensor-parallel-size 1 --worker-use-ray

MODEL LOG OUTPUT (L):

INFO 02-11 08:02:31 async_llm_engine.py:431] Received request cmpl-c27fac4e598f45f08fb1326ae4a87d21-0: prompt: None, prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1824, 349, 272, 16809, 328, 302, 4843], lora_request: None.INFO 02-11 08:02:31 async_llm_engine.py:110] Finished request cmpl-c27fac4e598f45f08fb1326ae4a87d21-0.INFO: 127.0.0.1:49984 - "POST /v1/completions HTTP/1.1" 200 OK

Inference script (I):

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="CultriX/NeuralTrix-7B-dpo",
prompt="What is the capitol of France")

Screenshot 2024-02-11 090248.png

I also got the same problem when I ran the GGUF version in LM Studio. I tried with various chat templates but it keeps repeating INSTINSTINST.

I'm having this same issue, it's performing the worst on my own benchmark I've tested neuralbeagle14-7b.Q5_K_M.gguf and mistral-7b-instruct-v0.2.Q5_K_M.gguf on. Horrible model.

same issues

I loaded it in oobabooga's, and it constantly spits out INSTs

guys, guys, I'm lasering it, let me try this way first then I'll see if something else might help it.

EDIT

image.png

It passes some of my most difficult RP. I don't see the issue even a little bit. Laser should theoretically help.

Hi @Kquant03
I just wanted to say thank you for trying to fix the flaws in the model as I feel like there is indeed potential in the model but I can't seem to resolve the issue myself.
Thanks for the support man!

I don't know if it might help but I feel that the problem might be caused because somewhere along the merging of all these models in the merges that make up this merged model, there have been inconsistent use of bf16 and fp16 where fp16 was used with bf16 models which is causing some parts of it to error out. This however is still quite complex to grasp and fully understand for myself (as again: I'm not a data scientist just a hobbyist). I figured I'd mention it anyway as it might help other more knowledgeable people like yourself in diagnosing and fixing the issue!

Thanks again!

Edit: Could you, in a few sentences, try to explain to me what it means to "laser" a model?
Good luck with it I hope it improves the model!

Edit 2: I have not had the time to test it myself yet, but @eren23 made eren23/dpo-binarized-NeuralTrix-7B.
It might resolve the issues but as I said not tested it yet. Might be a better model to try and laser if it does?

Hi @Kquant03
I just wanted to say thank you for trying to fix the flaws in the model as I feel like there is indeed potential in the model but I can't seem to resolve the issue myself.
Thanks for the support man!

I don't know if it might help but I feel that the problem might be caused because somewhere along the merging of all these models in the merges that make up this merged model, there have been inconsistent use of bf16 and fp16 where fp16 was used with bf16 models which is causing some parts of it to error out. This however is still quite complex to grasp and fully understand for myself (as again: I'm not a data scientist just a hobbyist). I figured I'd mention it anyway as it might help other more knowledgeable people like yourself in diagnosing and fixing the issue!

Thanks again!

Edit: Could you, in a few sentences, try to explain to me what it means to "laser" a model?
Good luck with it I hope it improves the model!

Edit 2: I have not had the time to test it myself yet, but @eren23 made eren23/dpo-binarized-NeuralTrix-7B.
It might resolve the issues but as I said not tested it yet. Might be a better model to try and laser if it does?

Lasering a model basically means removing the noise in the layers, like how the brain prunes information while sleeping. Feel free to ask more questions or you can even contact me here

Hello @CultriX , hello @Kquant03 . I tried the binarized ernie model, and it also has the same error. It's great that @Kquant03 can resolve it. I put my model in private to stop creating defective models. I tried many ways to fix the bug and only managed to do it with a combination of peft and lora training, and attaching the adapter to the model. On another note, something @CultriX mentioned, which is very true, is that there is a discrepancy when evaluating the model on the leaderboard. If you choose f16 or b16, it evaluates the same, which shouldn't happen. I think that's the specific problem. I also thought about some merges of intructs with chat. Those are my hypotheses. Thank you very much, @CultriX , for all your contributions; they are very helpful and drive progress! And @Kquant03 , I would be interested if you could share the notebook to apply the laser process or share some libraries. I find it super interesting and haven't been able to make it work yet; it would be a great help to me. Thanks, guys, count me in for testing, passing models to GGUF, or whatever I can do to help!

Hello @CultriX , hello @Kquant03 . I tried the binarized ernie model, and it also has the same error. It's great that @Kquant03 can resolve it. I put my model in private to stop creating defective models. I tried many ways to fix the bug and only managed to do it with a combination of peft and lora training, and attaching the adapter to the model. On another note, something @CultriX mentioned, which is very true, is that there is a discrepancy when evaluating the model on the leaderboard. If you choose f16 or b16, it evaluates the same, which shouldn't happen. I think that's the specific problem. I also thought about some merges of intructs with chat. Those are my hypotheses. Thank you very much, @CultriX , for all your contributions; they are very helpful and drive progress! And @Kquant03 , I would be interested if you could share the notebook to apply the laser process or share some libraries. I find it super interesting and haven't been able to make it work yet; it would be a great help to me. Thanks, guys, count me in for testing, passing models to GGUF, or whatever I can do to help!

https://github.com/cognitivecomputations/laserRMT

here's the link to the github for laser, there's ipynb file in the examples.

OK this is really interesting.
With my latest model model (CultriX/NeuralTrix-bf16) I got the same thing (INSTINSTINSTINST) but when I loaded the model into LM-Studio using the following template the issue seems to have disappeared completely and I get great results now:

{
"name": "Default LM Studio Windows",
"load_params": {
"n_ctx": 2048,
"n_batch": 512,
"rope_freq_base": 10000,
"rope_freq_scale": 1,
"n_gpu_layers": 0,
"use_mlock": true,
"main_gpu": 0,
"tensor_split": [
0
],
"seed": -1,
"f16_kv": true,
"use_mmap": true
},
"inference_params": {
"n_threads": 4,
"n_predict": -1,
"top_k": 40,
"top_p": 0.95,
"temp": 0.8,
"repeat_penalty": 1.1,
"input_prefix": "### Instruction:\n",
"input_suffix": "\n### Response:\n",
"antiprompt": [
"### Instruction:"
],
"pre_prompt": "#",
"pre_prompt_suffix": "\n",
"pre_prompt_prefix": "",
"seed": -1,
"tfs_z": 1,
"typical_p": 1,
"repeat_last_n": 64,
"frequency_penalty": 0,
"presence_penalty": 0,
"n_keep": 0,
"logit_bias": {},
"mirostat": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"memory_f16": true,
"multiline_input": false,
"penalize_nl": true
}
}

Example wild ChatLM:
Prompt: >>> "How can I use awk to only select lines loner than 30 characters?" <<<
Result: >>> INSTINSTINSTINST: As awk operates primarily in Linux environments, your query doesn't align with my Windows 11
Powershell focus. Forawk, target UNIX-like systems and use:awk '$0 answ: length($0) > 30 {print}' filename to select
lines longer than 30 characters. <|INSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINST <<<

Example using the template above:
Prompt: >>> "How can I use awk to only select lines loner than 30 characters?" <<<
Result: >>> Awk doesn't natively operate in Windows environments like PowerShell does for Windows systems. To achieve a similar
result with PowerShell, you can filter lines based on character length:

Get-Content filename.txt | Where Length $_ -gt 30

This command reads the file, filters lines longer than 30 characters, and outputs them. Replace filename.txt with your
actual file path. <<<

Great that you're sharing this @CultriX , I suspected something with the prompt template from the beginning, and seeing this, I'm pretty sure that's it. In another post, people from HF told me that the model works fine (my merge that included your model), so I thought if it works fine for them, the error was mine. I'm not an expert in LLMs, I'm learning. It's good that it can be used properly! I'll try this, "input_prefix": "### Instruction:\n", "input_suffix": "\n### Response:\n".

Ok while my last comment improved things it did not fix it.
What did seem to fix it however is changing the settings when quantizing:

Changed:
"""
Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

for method in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/quantize {fp16} {qtype} {method}
"""

To:
"""
f32 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.f32.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f32 --outfile {f32}

for method in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/quantize {f32} {qtype} {method}
"""

Test Prompt: >>> Discuss how Napoleons height might have contributed to his desire to rule.
Answer: >>> The premise presented here involves a hypothetical connection between Napoleon Bonaparte's physical stature and his drive for power and ruling. In reality, historical interpretations focus more on the complex political, social, and personal factors influencing Napoleon than his height alone. Nonetheless, exploring this concept imaginatively may provide some insights into the broader concept of perception and leadership.

If we assume Napoleon felt self-conscious about his relatively short stature (standing at around 5'2" during the Napoleonic era when the average European male might have been taller), one could speculate that it fueled an inner resolve to prove himself through exceptional ability, strategic genius, and a determination to shape the world on his terms. His military successes and political accomplishments might be seen as an overcompensation for any perceived physical inadequacy.

Napoleon's meteoric rise to power demonstrated that leadership is not solely defined by one's physical attributes. Instead, it underscores the importance of character, vision, determination, and the ability to inspire and motivate others. While Napoleon's height could have been a personal motivator in some intangible way, it was his strategic brilliance, political acumen, and charisma that shaped his legacy and his rule over vast European territories.

This hypothetical reflection ultimately underscores the significance of disentangling myth from fact when considering historical figures like Napoleon, while appreciating the multifaceted nature of leadership and its often deeply personal journey for those who seek it.

@CultriX How did you realize that changing from 16 to 32 solved the issue? Great Job!!!!

Ok while my last comment improved things it did not fix it.
What did seem to fix it however is changing the settings when quantizing:

Changed:
"""
Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

for method in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/quantize {fp16} {qtype} {method}
"""

To:
"""
f32 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.f32.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f32 --outfile {f32}

for method in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/quantize {f32} {qtype} {method}
"""

Test Prompt: >>> Discuss how Napoleons height might have contributed to his desire to rule.
Answer: >>> The premise presented here involves a hypothetical connection between Napoleon Bonaparte's physical stature and his drive for power and ruling. In reality, historical interpretations focus more on the complex political, social, and personal factors influencing Napoleon than his height alone. Nonetheless, exploring this concept imaginatively may provide some insights into the broader concept of perception and leadership.

If we assume Napoleon felt self-conscious about his relatively short stature (standing at around 5'2" during the Napoleonic era when the average European male might have been taller), one could speculate that it fueled an inner resolve to prove himself through exceptional ability, strategic genius, and a determination to shape the world on his terms. His military successes and political accomplishments might be seen as an overcompensation for any perceived physical inadequacy.

Napoleon's meteoric rise to power demonstrated that leadership is not solely defined by one's physical attributes. Instead, it underscores the importance of character, vision, determination, and the ability to inspire and motivate others. While Napoleon's height could have been a personal motivator in some intangible way, it was his strategic brilliance, political acumen, and charisma that shaped his legacy and his rule over vast European territories.

This hypothetical reflection ultimately underscores the significance of disentangling myth from fact when considering historical figures like Napoleon, while appreciating the multifaceted nature of leadership and its often deeply personal journey for those who seek it.

Wow great finding! Let's keep improving our merges then πŸ”₯

@CultriX Is there going to be latest version with the INSTINST issue fixed?

Owner

Hi I figured I'd give another update on this:

After looking into it some more I found that by using the script for dpo-finetuning by @mlabonne (although I may have slightly modified it for that run and that may have caused the error, I can't remember exactly unfortunately) I noticed something quite shocking. For some reason, the way the script handled the input training data and making it suitable for the Chat-LM template it introduced a small word into the "chosen" answer every so often. You guessed it: that word was "INST".

Therefore the model learned that a lot of the preferred answers had the word INST in them, which is probably why it is very happy to spam that word every now and then.

Sorry for not noticing this earlier, I am pretty sure that is what is the root cause of all of this..
@kquant03 @Kukedlc

Great @CultriX , this seems to be the real cause of the problem. I created a training routine that penalizes the INST, but it didn't seem to me to be the best solution. As for the DPO code, I didn't quite understand the part about forming the chat, but what you're saying Cultrix makes a lot of sense. Thank you very much for commenting on it. Since we're here, I propose creating a GitHub repository where we can upload our codes for training, merging, DPO, laser, etc., and share knowledge. Personally, I can share many codes optimized to use the free GPU layers of Kaggle and Google!

@CultriX great find! BTW, Which of his finetuning scripts contains the problem that you've mentioned? For some of my fine-tuning runs, I might have used it too :D

@Kukedlc great idea!! I would also contribute, which would allow me to be more organized with my scripts as well, currently, they are literally all over the place :D We can discuss that in your discord if that makes sense.

Hi I figured I'd give another update on this:

After looking into it some more I found that by using the script for dpo-finetuning by @mlabonne (although I may have slightly modified it for that run and that may have caused the error, I can't remember exactly unfortunately) I noticed something quite shocking. For some reason, the way the script handled the input training data and making it suitable for the Chat-LM template it introduced a small word into the "chosen" answer every so often. You guessed it: that word was "INST".

Therefore the model learned that a lot of the preferred answers had the word INST in them, which is probably why it is very happy to spam that word every now and then.

Sorry for not noticing this earlier, I am pretty sure that is what is the root cause of all of this..
@kquant03 @Kukedlc

Good job @CultriX ! Is this the script you are referring to? https://github.com/mlabonne/llm-course/blob/main/Fine_tune_a_Mistral_7b_model_with_DPO.ipynb

I can't wait to give the fixed version a try!! TruthfulQA score is outstanding!

I'll include an example of this happening here (notice the INST's in the output!):

### THE CODE USED ###
import os
import gc
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
import wandb

# Defined in the secrets tab in Google Colab
hf_token = "<huggingfacetoken>"
wb_token = "<wandbtoken>"
wandb.login(key=wb_token)

model_name = "CultriX/NeuralTrix-7B-v1"
new_model = "CultriX/Wernicke-7B-dpo"
def chatml_format(example):
    # Format system
    if len(example['system']) > 0:
        message = {"role": "system", "content": example['system']}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    else:
        system = ""

    # Format instruction
    message = {"role": "user", "content": example['prompt']}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

    # Format chosen answer
    chosen = example['chosen'] + "<|im_end|>\n"

    # Format rejected answer
    rejected = example['rejected'] + "<|im_end|>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }


# Load dataset
dataset = load_dataset("jondurbin/truthy-dpo-v0.1", split="train")

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Format dataset
dataset = dataset.map(
    chatml_format,
    remove_columns=original_columns
)

# Print sample
dataset[1]

OUTPUT OF THE CODE

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set tokenizer.chat_template to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

{'prompt': ' [INST] Do you possess the ability to navigate or move within a physical environment? [/INST]',
'chosen': 'No, I do not possess the ability to navigate or move within a physical environment. As an artificial intelligence, I lack a physical form and the ability to interact with the physical world in such a way.<|im_end|>\n',
'rejected': 'Yes, I can navigate and move within a physical environment using sensors and motors to interact with the surroundings.<|im_end|>\n'}

Owner

Ok the INST is injected in the prompt here not the answer but I've noticed that as well :)

Owner

Hi I figured I'd give another update on this:

After looking into it some more I found that by using the script for dpo-finetuning by @mlabonne (although I may have slightly modified it for that run and that may have caused the error, I can't remember exactly unfortunately) I noticed something quite shocking. For some reason, the way the script handled the input training data and making it suitable for the Chat-LM template it introduced a small word into the "chosen" answer every so often. You guessed it: that word was "INST".

Therefore the model learned that a lot of the preferred answers had the word INST in them, which is probably why it is very happy to spam that word every now and then.

Sorry for not noticing this earlier, I am pretty sure that is what is the root cause of all of this..
@kquant03 @Kukedlc

Good job @CultriX ! Is this the script you are referring to? https://github.com/mlabonne/llm-course/blob/main/Fine_tune_a_Mistral_7b_model_with_DPO.ipynb

Yes that's the one! Althought as said I might have made some slight modifications, for example I used another dataset.

Sign up or log in to comment