GPTQ model is available

by MaziyarPanahi - opened Jan 22

Discussion

MaziyarPanahi

Jan 22

Congratulations! This model is now ranked #1!!!

I went ahead and made a GPTQ quantization for this and happy to share it: MoMo-72B-lora-1.8.7-DPO-GPTQ

mirek190

Jan 22

I was making my private tests for understanding and reasoning and common sense of that llm and seems like I talk with finetuned very old llama 65b ... poor results.
For instance mistral instruct 0.2 seems to be much more advanced in understanding, reasoning and common sense . I not even mentioned mixtral 8x7b which is like on totally different level... leaps ahead.

I suspect this model is contaminated and that is why so high on the leaderboard.

MaziyarPanahi

Jan 22

@mirek190 the HF staff and the community are very active to flag contaminated models specially top 20. Do you happen to have some examples where mixtral does better?

mirek190

Jan 22

•

edited Jan 22

In every my test mixtral is better including coding.
For instance try :

Create 10 sentences that ends with a word "apple".
Or
Provide complete working code for realistic looking tree in python using turtle library and recursive algorithm.

The MoMo model fail those and most other my 20 questions where mixtral 8x7b instruct is able to answer all my questions properly and mistral 7b 0.2 instruct more than a half of them ( ~15).

That's why I claim that model is contaminated.
I do not believe model with such hight score is so stupid with lack of reasoning and common sense .
Answers are very similar to old llama 65 models what I was testing 7 month ago .

Actually any open source model right now is better than general knowledge LLM mixtral 8x7b instruct. I'm testing most of the then and didn't find any model is able to answer even those 2 questions properly expect mixtral ...
Of course coding models will answer question about that python tree but only newer ones like a wizard coder 34b 0.2 .

You can find my nick mirek190 with tests of those models on huggingface. I did it because only those models were worth after tests to say something more in the internet.

bongchoi

Jan 23

Hi, we haven't trained our model on any datasets other than the three mentioned in our model card

Open-Orca/SlimOrca
jondurbin/truthy-dpo-v0.1
Intel/orca_dpo_pairs

and to the best of our knowledge, these three are not contaminated data.

+ we have tested contamination refer to [https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/472]
gsm8k: result < 0.1, %: 0.47
truthfulqa: result < 0.1, %: 0.44

contamination test results for other tasks will be updated soon

MaziyarPanahi

Jan 23

In every my test mixtral is better including coding.
For instance try :

Create 10 sentences that ends with a word "apple".
Or
Provide complete working code for realistic looking tree in python using turtle library and recursive algorithm.

The MoMo model fail those and most other my 20 questions where mixtral 8x7b instruct is able to answer all my questions properly and mistral 7b 0.2 instruct more than a half of them ( ~15).

That's why I claim that model is contaminated.
I do not believe model with such hight score is so stupid with lack of reasoning and common sense .
Answers are very similar to old llama 65 models what I was testing 7 month ago .

Actually any open source model right now is better than general knowledge LLM mixtral 8x7b instruct. I'm testing most of the then and didn't find any model is able to answer even those 2 questions properly expect mixtral ...
Of course coding models will answer question about that python tree but only newer ones like a wizard coder 34b 0.2 .

You can find my nick mirek190 with tests of those models on huggingface. I did it because only those models were worth after tests to say something more in the internet.

Thanks for providing some examples. Just out of curiosity, do you have a specific System Prompt or chat template? In HuggingChat Mixtral 8x7B Instruct is not able to answer the first question correctly:

mirek190

Jan 24

•

edited Jan 24

In every my test mixtral is better including coding.
For instance try :

Create 10 sentences that ends with a word "apple".
Or
Provide complete working code for realistic looking tree in python using turtle library and recursive algorithm.

The MoMo model fail those and most other my 20 questions where mixtral 8x7b instruct is able to answer all my questions properly and mistral 7b 0.2 instruct more than a half of them ( ~15).

That's why I claim that model is contaminated.
I do not believe model with such hight score is so stupid with lack of reasoning and common sense .
Answers are very similar to old llama 65 models what I was testing 7 month ago .

Actually any open source model right now is better than general knowledge LLM mixtral 8x7b instruct. I'm testing most of the then and didn't find any model is able to answer even those 2 questions properly expect mixtral ...
Of course coding models will answer question about that python tree but only newer ones like a wizard coder 34b 0.2 .

You can find my nick mirek190 with tests of those models on huggingface. I did it because only those models were worth after tests to say something more in the internet.

Thanks for providing some examples. Just out of curiosity, do you have a specific System Prompt or chat template? In HuggingChat Mixtral 8x7B Instruct is not able to answer the first question correctly:

Check here

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/discussions/1

Try those questions and you find out that answers are worse with MoMo in spite of much higher on leader board.
If model s not contaminated then the testing procedure is very unreliable right now.

Or you can just talk about everything with model and is clear after a few minutes of conversation how smart and intelligent model is... MoMo as I said before has reasoning and intelligence finetuned of first llama 65b for me.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment