Thanks for this!
I'm running ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q3_K_S.gguf (as it's the largest I can fit into a 24GB card, so I don't have to also occupy my secondary card), and it's running great!
I had previously run a task on TheBloke_Wizard-Vicuna-30B-Uncensored-GPTQ. I had to modify it a good bit as this model tends to want to "elaborate on its input using its knowledge of the topic" when I wanted it to only act on what the input text said. And a couple other issues, like it tending to only focus on the end of the input text. But I was largely able to fix them via prompt adjustments. Now it's making nice, neatly-formatted, consistent responses, and overall the output is better than TheBloke_Wizard-Vicuna-30B-Uncensored-GPTQ. Very nice!
BTW, if you ever feel fit to do any "lightweight" models (e.g. just a couple billion parameters), that would be appreciated. :) They're nice to use as a base for training new, highly-specialized models. You could even just take a heavyweight model and strip out deep layers then retrain for a bit - I've seen this done with a (non-open) model and it worked well.
BTW, if you ever feel fit to do any "lightweight" models (e.g. just a couple billion parameters), that would be appreciated. :)
I asked for support of the falcon 1B and 7b RW variants in Llama.cpp, ut they seem to bee too different from the standard Falcon to be easily converted to gguf right now. I know there are other "small" models and will have a look soon.
:)