KoboldAI and --act-order?

#3
by GamingDaveUK - opened

I see "It was created without group_size to minimise VRAM usage, and with --act-order to improve inference quality."
I have been using koboldai for story creation as you can easierly influence how you want the next sentence to start etc....but I am a novice.... hell i only got kobold to even work with 4bit models a couple of days ago (needs a special branch and you have to rename the safetensor file to 4bit.safetensor).
Not at home yet but going to give it a try when I am...but how do I add --act.order? Also I note someone said it used almost all of thier 3090 vram, I have seen that slows generation down in some models i have tried as the vram maxes out, is there any tricks to getting it to use a bit less.

On a side note, Thank you for posting these, though i may not know how to use the models with out configs etc, the ones i can use have worked really well.

I've not yet tried Kobald myself but at least with other UIs you shouldn't need to do anything regarding act-order. It should just work automatically.

And yes it's expected that a 30B model will fill a 24GB card. In my testing I had 13MB VRAM free after generating 2000 tokens with a 30B model! So there is no margin for error at all.

This also means that if you're using your GPU with a display, you will likely go out of memory. With 30B need to have a GPU that's only used for inference, and isn't also being used by the OS to drive monitors.

If that's not the case for you then your options are:

  1. Use a 13B model instead
  2. Use a GGML 30B instead, with layers offloaded to GPU. This will be slower than the GPTQ - perhaps half the inference speed - but it won't ever go out of VRAM. This is the method people use when they want to use a model bigger than they have VRAM for.

For that you would use KobaldCpp which I also have no experience of, but I believe it works well. You'd just need to confirm it works with CUDA GPU offload.

I used this guide https://docs.alpindale.dev/local-installation-(gpu)/koboldai4bit/ (though make sure you click use new ui, or try new ui before loading the model)and your https://huggingface.co/TheBloke/hippogriff-30b-chat-GPTQ model, its pretty good, if you tell the prompt the start of your story and leave the last sentence unfinished, it then finishes the sentence, sometimes adding more. If you type the name of a character and hit return it continues the sentence or makes the character say something. 80% of the time its been coherant.... the rest of the time it does go off on a tangent. but not played with it much, also had tavern ai loaded up and that uses koboldai as an api (oogabooga can also be used as an api for that), but thats more chat bot based.
Kobold does let you split the layers, though i loaded all 60 layers of hippogriff into my gpu
Should get a chance to try this model in a bit, will let you know how i get on.

windows 11: rtx 3090: plugged into a 2k? monitor:
model loaded in kobold took 17.9gb....not bad
this jumped to 19.3gb when generating and then sat at 18gb.
Ok so this is hardly a scientific test but so far good lol
image.png

Bellow is the story (really really short story) I created with the AI's help. White is what i typed, yellow is its reply. in koboldai if your in story mode its best to end with an incomplete sentence, it uses the model to finish it and then generate text so seeing the white part end with " is not a mistake but me getting the ai to add dialog... the reason for the subject matter here is that i also submitted this to a youtuber I follow in his discord, so i tailored it to a scenario he often asks llm's to write a short poem or story about when he tests llm's out.

image.png

Vram went up to 22gb during some of the longer generations, but always returned to about 18.3gb I have yet to test tavernai with kobold using this model.... not sure it will work as the model is for story telling not chat...but vram use shouldnt change much. I did notice the model had a tendency to try and finish the story and it copied my tired typing of i instead of I.

ok I hope the info is helpful and than you again for the model, this is my new favorite, really impressed

Great, thanks for the info and glad it's working well for you !

Sign up or log in to comment