Thanks. Doesn't work with GPT4ALL Though.

#3
by Phil337 - opened

Q4_K_M loaded in GPT4All v2.7.2, but the output is gibberish. Probably just compatibility issue with mixture of elements.

Prompt: Define three terms from the field of astronomy...

Response: It seems like there's discussing astrobot: Skillustratorially focused on Engineering concepts related toadvanced education focusing more specifically on aerospacefully understanding complex scientifically, biologie focuses primarily on a brief overview of high- biologie focuses on biologie undoubtedly an intricate science fiction writing about the concept of biologie focused on subjects in kennis derivespecialearl Abigure educational institutions and academic researching engineering students with arose in biologie education: arose in biologie...

I have tested this with Llama.cpp (just a quick check to see if it works):

(llamacpp) maziyar@karma:~/quantize/gguf$ llama.cpp/main -m quantized/MixTAO-7Bx2-MoE-Instruct-v7.0.Q2_K.gguf -p "[INST] please repeat back: apple, orange, juice[/INST]" -n 400
-e


sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 1


 [INST] please repeat back: apple, orange, juice[/INST]Apple, Orange, Juice. [end of text]

llama_print_timings:        load time =     447.56 ms
llama_print_timings:      sample time =       5.04 ms /     9 runs   (    0.56 ms per token,  1784.30 tokens per second)
llama_print_timings: prompt eval time =     507.85 ms /    17 tokens (   29.87 ms per token,    33.47 tokens per second)
llama_print_timings:        eval time =    1017.98 ms /     8 runs   (  127.25 ms per token,     7.86 tokens per second)
llama_print_timings:       total time =    1535.65 ms /    25 tokens
Log end

Keep in mind this is the lowest quant Q2, and also I just assumed the prompt template is Mixtral (the original model didn't mention anything about how to prompt the model)

Q4_K_M loaded in GPT4All v2.7.2, but the output is gibberish. Probably just compatibility issue with mixture of elements.

Prompt: Define three terms from the field of astronomy...

Response: It seems like there's discussing astrobot: Skillustratorially focused on Engineering concepts related toadvanced education focusing more specifically on aerospacefully understanding complex scientifically, biologie focuses primarily on a brief overview of high- biologie focuses on biologie undoubtedly an intricate science fiction writing about the concept of biologie focused on subjects in kennis derivespecialearl Abigure educational institutions and academic researching engineering students with arose in biologie education: arose in biologie...

image.png

So this only runs well on colab, but not locally? All I get is garbage. Curious if anyone has got this model to run well on Kobaldcpp and what your settings were.

@zhengr Thanks, that output looks perfect. Glad to know its working. Apps like GPT4All and Kobaldcpp are commonly weeks behind properly supporting new models (e.g. Yi).

I tested Kobaldcpp like @SuperSkirv and it's a little more coherent than GPT4All, but still clearly not there yet.

"In the realm of astrophysics, particularly Astronomy, specifically studies celestial bodies, spatial objects,espécially linked to study of studying the universe, astronomy, studying celestinal terrestrial objects, astrology focuses on astronomical, astronomy revolves in..."

There are two easy ways to go with text-generation-webui (https://github.com/oobabooga/text-generation-webui):

  1. Try my DEMO Google Colab notebook on cloud (https://colab.research.google.com/drive/1y2XmAGrQvVfbgtimTsCBO3tem735q7HZ?usp=sharing) , The free T4 enough to run model with 4bit.
  2. Run the text-generation-webui notebook on your local GPU resources.

Just click run all cells --> wait --> click share-link, that all.

it does work from my GGUF in LM Studio

image.png

This model is junk with koboldcpp

Usually when a gguf model works with Llama.cop and/or a specific library like LM Studio but not the others (koboldcpp) it coule be one of a few things: chat template, tokenizer, generation parameters, different/older version of Llama.cpp.
If you can use lm studio just to compare some of these it may help narrowing it down.

Usually when a gguf model works with Llama.cop and/or a specific library like LM Studio but not the others (koboldcpp) it coule be one of a few things: chat template, tokenizer, generation parameters, different/older version of Llama.cpp.
If you can use lm studio just to compare some of these it may help narrowing it down.

It is literally gibberish.

It doesn't even work with text-generation-webui

@ReXommendation I tested several MOEs, not just this one, and was hoping they would eventually achieve coherency. Since they weren't I did a little research and discovered it's not even theoretically possible.

For an MOE to maintain coherency the experts need to be trained together with the router. Taking independently created and fine-tuned LLMs and combining them into an MOE may improve some test scores, but coherency will always be broken to some degree as you string together tokens from different LLMs with differing writing styles. Plus you don't get any of the additional benefits, such as greater information density. For example, the MMLU, or total knowledge, will remain around 65, even if you make an MOE out of 100 Mistrals.

So my recommendation is to just stick to true MOE's like Mixtral and Qwen MOE 2.7b.

Phil337 changed discussion status to closed

Thanks @Phil337 for your valuable inputs, really love reading your reviews. That's why I started fine-tuning the MoE models after I merged them to get a better quality afterwards. Like Qwen1.5-8x7b-v0.1, I will try to find some GPUs so I can fine-tune them with longer sequence like 16k.

@MaziyarPanahi Thanks for all your uploads. I still have this one in the hope that a future version of GPT4All supports it properly and doesn't produce nonsensical responses.

I didn't even know Qwen1.5-8x7b existed. That's pretty cool that people are taking base models and doing addition training to turn in them into functional MOEs. From what I read from Alibaba themselves, that's how they made Qwen 2.7b MOE. They started with fully trained smaller models, combined them into an MOE, and then turned them into an organic MOE with less computer than if they started from scratch.

Sign up or log in to comment