SanjiWatsuki/Kunoichi-7B · Great to... Awesome

Jan 5, 2024

I tried your models since loyal-macaroni-maid for RP which I found great.

Then I tried silicon-maid, sonya and lelantos which did not compel me to switch from loyal-macaroni despite the higher benchmarks of newer models.

Kunoichi though is really a level higher. It maintains the creativity level while being more clever (respecting new rules on-the-fly) and especially follows instructions more closely, which is tremendously useful.

It does not feel like a 7B model. It feels way above.

Awesome job. Keep up the good work.

PS: It would be interesting to see an offspring at 13B and 20B. Or maybe an MoE... (even if the various MoE I have tried hardly convince me)

SanjiWatsuki

Owner Jan 5, 2024

Thanks for trying my models out! I agree - my hope is that this model is a noticeable step improvement from my previous ones. I'm looking forward to seeing more feedback from others who give it a shot!

Most likely I'll keep messing with 7Bs. I started making these 7B models because I had a hypothesis that 7Bs were more capable than 13/20Bs. Once Mixtral finetunes get figured out, I might try my hand at those, though :)

xpgx1

Jan 10, 2024

Oh, it certainly is - I've never encountered such a smart cookie in a 7B package that is universally usable - It's really nice ^-^' - as it can roleplay nicely, but also answer rather quickly and be on point with the first response. It lacks a bit in regen capabilities (it needs more time to iterate on its own mistakes - but it'll eventually catch them), but since the reply is so fast - that hardly matters to me.

The safety bias is active, yesh yesh, but it's such a step up from other models (I've personally tried, like Captn Turtle above - I've seen similar ones so I won't list them) even compared to the 13B ones I have used - its really swell =) Especially since people with additional vram headroom can now increase context and activate the cfg-cache - for example, or ..yknow, USE the PC for other tasks as ST Extras and a smaller, lighter Stable Diffusion instance that also can use the vram budget. Such a smart 7B model is clearly a sweet spot, 8GB Vram users shouldn't be left out =)

But yes, any variety on this one seems a fantastic prospect! Thank you for trialing this one here - it absolutely enriched my experience with RP models - certainly!

LunarAura

Jan 15, 2024

•

edited Jan 15, 2024

I haven't touched 7b models for a long time, not until I came across your Silicon-Maid model. I was surprised at how good that model is and was using it for RP for a week.

Then I decided to try Kunoichi. I have to say that Kunoichi is one step above Silicon-Maid.

I did see a noticeable difference in that the logic is slightly improved but it remains good at roleplaying and following the character card. All of which are important when it comes to a model for RP. I forget how small this model is since it feels bigger. I've used both 13b and 20b models. Though I wouldn't go as far as to say Kunoichi is as good as 20b models, (since I feel like 20b does a better job at storywriting) I'd say the quality meets somewhere in between 13b and 20b. Not bad for a little guy! Especially since there is only so much you can do with a small model.

I am hoping that having these big improvements in 7b models will encourage 13b models to make big improvements as well!

Tester100

Jan 15, 2024

Most likely I'll keep messing with 7Bs

Some people out there respect this a lot for sure! Even when it not may be obvious. Because on Discord and even Reddit some of your models were recommended as an insider's tip. It's how I stumbled upon yours in the end.

Zuzus

Jan 19, 2024

•

edited Jan 19, 2024

Which Loader do I use for this model? Sorry for being such a newbie..😅

xpgx1

Jan 21, 2024

Which Loader do I use for this model? Sorry for being such a newbie..😅

Ahh, nonsense. Everybody starts somewhere! It would help if you tell us which software you wish to use for this?
If you're using OobaBooga Web UI, for example, i'd suggest using ExLlama2 or their _HF variant. They are actively maintained and replaced the old exllama kernel. Perfect fit for GPU acceleration, in my experience.

Have a gr8 Sunday, Zuzus!

Zuzus

Jan 21, 2024

@xpgx1 I'm using OobaBooga, I tried loading with ExLlama2/_HF but it gives me error, tested with llama.cpp but still error. It only load with Transformers but it is so slow for me.
What's the minimum VRAM Kunoichi-7B need?

xpgx1

Jan 22, 2024

A brief primer on all those fancy words:

In general, we "end-users", who are just trying to make some inference for RP and ERP purposes =) -> we use the so-called "quantized" variants of those LL models. Why? They simply save space (and have a lower accuracy - but that's to be expected - You can't have it all).

THIS model here, "SanjiWatsuki/Kunoichi-7B", represents the "unquantized", floating point 16 variant. The "original" so to speak. Sanji fine-tuned this model and provided it here for us to use. When you select the "files and versions" tab at the top of the page, you can see how large these files are. In our case here, we're dealing with 9.86 GB + 4.62 GB = 14.48 GB VRAM usage (roughly, it depends sometimes on the loader and what else is stored in the fastes RAM our GPUs can access, the VRAM.)

-> So, in order to use this model on a VRAM-constrained GPU, please download a quantized or "4-bit version" of this. I'd recommend "TheBloke/Kunoichi-7B-GPTQ" in the "gptq-4bit-32g-actorder_True" permutation or "branch". WTF is this branch you ask? It's another fancy word indicating how exactly this model wants to be loaded. It has an impact on inference quality. The model files are, usually, the biggest files within a model card.

To quickly download this quantized model from HF, just copy pasta this line into the ooba download field "TheBloke/Kunoichi-7B-GPTQ:gptq-4bit-32g-actorder_True". If this model finished loading, please repeat the loading process and report back here Zuzus. You will see it now fits into your 8GB VRAM budget. ^-^

Lastly, whenever you encounter an error message - please include at least the last line of output so that others can understand WHICH error you got.

I hope this helps!

Zuzus

Jan 22, 2024

Working very well, thank you very much for explaining all this now I know what to look for correctly :D

Noire1

Jan 22, 2024

@xpgx1 I'm new to this too.. thank you, you just explained what I needed. This is the first model I loaded using Exllama and I'm surprised how fast it is!!!
I will pay more attention to models that support Exllama loader.

Would you know how to tell if a model is prepared for Exllama if the developer does not specify it on the page? and what's the context size? different from .gguf loaded with llama.cpp it does not show how much context has been trained on the cmd.

xpgx1

Jan 22, 2024

•

edited Jan 22, 2024

@Noire1 - No problem. I know how confusing (or at least complex) this can be. And it IS an emergent tech field, so there is a huge wealth of info out there and not a lot of easily reachable documentation. Its entirely normal to feel a bit lost. Personally, I don't want people only flocking to cloud services, LLMs should be in our hands, directly. So I think you bring up another valid point, another fancy word and pitfall for starters, I forgot to mention it above!

The FORMATS - this is not a perfect list nor is it comprehensive, but it will give you a quick reference:
When you quantize the original, fp16 files, you can choose into which specific "format" or "container" you want your LLM to fit. It's absolutely comparable to mp4, mkv etc. Only in this case it also enables or disables certain use cases. So it's a good idea to know what to expect, roughly.

GPTQ: My recommendation - as it's easy, fast, and "simple". This is a pure GPU-accelerated "loader". Any model in this format will be loadable by ExLlama and ExLlama2 (try using the _HF variants first - they often tie in a bit better with SillyTavern - for example.)

GGUF: It's the successor to GGML and will be able to load models into RAM and VRAM - splitting them effectively. This is a nice way to experiment with larger models that would normally exceed your VRAM budget. Since LLMs are ... "structured" in many different layers, like a lasagna (I'm not making this up! =), you can offload a certain number of layers into the RAM. Your CPU will take those and TRY to keep up with GPU-accelerated layers. It will cost time per token, but it will be at least doable. This is more or less a bit more involved, depending on the software you use. I love Ooba, as it is very flexible. But ymmv.

EXL2: New format, different flavors, I'd skip it for now. If you guys need more info - just ask co-pilot. Really, he will point you in the right direction (reddit =P).

AWQ: Another new format, looks rather promising and will be relevant going forward. It's not necessary, in my mind, to expand on this here. Co-Pilot knows! =)

Others: Not that relevant, as these are probably BRAND new, difficult to implement for starters and don't provide you any benefits, currently (as of Jan24).

=> So - to enjoy these models I'd suggest using the Huggingface search bar and typing in your model name - plus GPTQ (or the format you like to use). That is also precisely why "TheBloke" is essentially doing a REAL service - as he mercilessly converts many popular models from fp16 into all kinds of formats and provides all kinds of branches (and loader-specific variants).

Phew, this was a lot. Couldn't condense it more =) I hope this helps!

Noire1

Jan 22, 2024

•

edited Jan 22, 2024

@xpgx1 Thank you soo sooo much!!!

vladfaust

Aug 13, 2024

•

edited Aug 14, 2024

I just wanted to say that Kunoichi is my go-to model for RP. As the time goes by, new models are released, but nothing in the range of up to 13B beats Kunoichi. Thank you for such a masterpiece of a model, @SanjiWatsuki ! That is really a miracle to be able to squeeze so much cognitive ability into such a size.

cybermazinho

Aug 14, 2024

Modelo incrivel, queria saber sobreo o processo de desenvolvimento