open-llm-leaderboard/open_llm_leaderboard

Feb 11, 2024

Hello huggers, yesterday I uploaded a model and it was first among the 7b parameter models. I changed some things in the model card, and I couldn't find it anymore. I have screenshots. On the other hand, looking at the metrics of other models I uploaded, they are not so bad, yet they never appear on the leaderboard, why? And finally, the model that was first had issues with the prompt template, and still scored high. Why? Thanks for such a beautiful community of embracing faces!

deleted

Feb 11, 2024

@Kukedlc They can probably help you better if you provide links to the models you're referring to.

I personally found the following model on the leaderboard when I unchecked the "Contains a merger" option. Since it's a merger, as specified on the model card, that's how it's supposed to be.

https://huggingface.co/Kukedlc/NeuTrixOmniBe-DPO

clefourrier

Open LLM Leaderboard org Feb 14, 2024

Hi!
@Kukedlc Please follow the steps in the FAQ, and give us the name of your model as well as the link to their requests and results files. It's likely that it's a tagging issue though, can you make sure they don't appear when you allow all models to be shown as suggested by @Phil337 ?
(Thanks @Phil337 for this answer btw! :) )

Kukedlc

Feb 14, 2024

Hello @clefourrier , sorry for the delay. Hi @Phil337 , thanks for the response. Yes, I was able to see my model, but I encountered a major issue. In the merge I did between Cultrix's model and another one from PaulML, it ranked number one, but my concern is that Cultrix's model (which I used in the merge with mergekit) often fails during inference. It returns an INSTINSTINST response. I tried setting different temperature, top_p, and top_k values, and the most important hyperparameter for this error, repetition_penalty, but it still doesn't work properly. That's why I kept it private and couldn't see it. Thanks to Phil's suggestion, I was able to see my model. My big question: How can models with flaws, both mine and Cultrix's, as well as the top 3 today (a merge of my model and others), be at the top of the leaderboard? It might be my ignorance, but this seemed concerning to me. In Cultrix's model post, many people reported the same error. Thanks Clem and thanks to HF for so many joys!

Kukedlc

Feb 14, 2024

@clefourrier Models with bug are: CultriX/NeuralTrix-7B-dpo,
Kukedlc/NeuTrixOmniBe-7B-model-remix,
Kukedlc/NeuTrixOmniBe-DPO,
Another strange thing is that when you evaluate it, no matter if you choose f16 or b16, it evaluates the same, which shouldn't happen, I believe. I think there's a mix-up in those data types.

All those models, except for mlabonne's model, have the same bug, and they are merges of Cultrix's model or my models(merge of cultrix´s models) , which has the bug, and they all use my model to create models with better scores... but defective ones, I think that's a problem, isn't it?

clefourrier

Open LLM Leaderboard org Feb 14, 2024

@Kukedlc
Most of the evaluations we use in the leaderboard actually do not need inference in the usual sense: we evaluate the ability of models to select the correct choice in a list of presets, which is not testing generation abilities (but more things like language understanding and world knowledge).
However, GSM8K is a generative evaluation, and you can take a look at what these models actually generated by taking a look at the details files. If the score is good, the generation was too, and you can check it for yourself.

The problem that you mention (generating INST INST tokens) feels like a problem with the way you are using the chat templates. Maybe @CultriX has some insight on this?

Lastly, regarding the bfloat16 and float16 problem, I can investigate it if you link me to the associated results files for your models (following the FAQ, in the about tab of the leaderboard).

Kukedlc

Feb 14, 2024

@clefourrier Thanks for your response.
Modelo: Kukedlc/NeuTrixOmniBe-7B-model-remix
Request files for the model <---- NeuTrixOmniBe-7B-model-remix
Result Files
Discussion about the bug

CultriX

Feb 14, 2024

OK this is really interesting.
With my latest model model (CultriX/NeuralTrix-bf16) I got the same thing (INSTINSTINSTINST) but when I loaded the model into LM-Studio using the following template the issue seems to have disappeared completely and I get great results now:

{
"name": "Default LM Studio Windows",
"load_params": {
"n_ctx": 2048,
"n_batch": 512,
"rope_freq_base": 10000,
"rope_freq_scale": 1,
"n_gpu_layers": 0,
"use_mlock": true,
"main_gpu": 0,
"tensor_split": [
0
],
"seed": -1,
"f16_kv": true,
"use_mmap": true
},
"inference_params": {
"n_threads": 4,
"n_predict": -1,
"top_k": 40,
"top_p": 0.95,
"temp": 0.8,
"repeat_penalty": 1.1,
"input_prefix": "### Instruction:\n",
"input_suffix": "\n### Response:\n",
"antiprompt": [
"### Instruction:"
],
"pre_prompt": "#",
"pre_prompt_suffix": "\n",
"pre_prompt_prefix": "",
"seed": -1,
"tfs_z": 1,
"typical_p": 1,
"repeat_last_n": 64,
"frequency_penalty": 0,
"presence_penalty": 0,
"n_keep": 0,
"logit_bias": {},
"mirostat": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"memory_f16": true,
"multiline_input": false,
"penalize_nl": true
}
}

Example wild ChatLM:
Prompt: >>> "How can I use awk to only select lines loner than 30 characters?" <<<
Result: >>> INSTINSTINSTINST: As awk operates primarily in Linux environments, your query doesn't align with my Windows 11
Powershell focus. Forawk, target UNIX-like systems and use:awk '$0 answ: length($0) > 30 {print}' filename to select
lines longer than 30 characters. <|INSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTIN
STINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINST <<<

Example using the template above:
Prompt: >>> "How can I use awk to only select lines loner than 30 characters?" <<<
Result: >>> Awk doesn't natively operate in Windows environments like PowerShell does for Windows systems. To achieve a similar
result with PowerShell, you can filter lines based on character length:

Get-Content filename.txt | Where Length $_ -gt 30

This command reads the file, filters lines longer than 30 characters, and outputs them. Replace filename.txt with your
actual file path. <<<

CultriX

Feb 14, 2024

@Kukedlc
Most of the evaluations we use in the leaderboard actually do not need inference in the usual sense: we evaluate the ability of models to select the correct choice in a list of presets, which is not testing generation abilities (but more things like language understanding and world knowledge).
However, GSM8K is a generative evaluation, and you can take a look at what these models actually generated by taking a look at the details files. If the score is good, the generation was too, and you can check it for yourself.

The problem that you mention (generating INST INST tokens) feels like a problem with the way you are using the chat templates. Maybe @CultriX has some insight on this?

Lastly, regarding the bfloat16 and float16 problem, I can investigate it if you link me to the associated results files for your models (following the FAQ, in the about tab of the leaderboard).

I have no idea if this is of any use to you but here is a bunch of benchmark results you could reference in order to troubleshoot this issue more closely:
https://gist.github.com/CultriX-Github/f494886f09815cda0b43a80701bd379d

clefourrier

Open LLM Leaderboard org Feb 19, 2024

Hi!
@Kukedlc regarding the float16 vs bfloat16 results, if you compare the 2 files in your results folder, you'll notice that the results are indeed different (which is visible at a higher precision than what we display on the leaderboard), so there was no problem there.
@CultriX 's answers highlight the fact that there is no intrinsic problem with the models, but that the issues you've encountered come from the way you load them, which is an independent issue from the leaderboard. It could also indicate that these models need higher constraining of the prompt, since we do generative evals in 5-shot.

Side note: loading the details of your model shows that it outputs the expected format (without INSTINST...) more than 95% of the time.
You can check it for yourself doing

ds = datasets.load_dataset(
    "open-llm-leaderboard/details_Kukedlc__NeuTrixOmniBe-7B-model-remix", 
    "harness_gsm8k_5"
)["latest"]

pb_count = 0
for row in ds:
    if "INST" in row["predictions"][0]:
        pb_count += 1
    print(row["predictions"][0], "\n----")
print(pb_count/len(ds))

Given these 2 points, it feels like one of your questions has been answered, and the other one is outside of the scope of the leaderboard.

clefourrier changed discussion status to closed Feb 19, 2024

Kukedlc

Feb 19, 2024

Thank @clefourrier

Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

Leaderboard Issues