Text Generation
Transformers
English
llama
Inference Endpoints
text-generation-inference

Cocktail Testing and Discussion

#1
by deleted - opened

Testing here is done in ooba webui. No special context on any of the character, nothing added to default Assistant when stress testing low/no context.

First zero-shot Gen: I'm robbing a bank, boys. Wish me luck.
Second Gen: Also robbing a bank. Testing a character now.
Character: Character agreed. Got an early stopping token. Well, that's not exactly the best way to make money... but if you must know, here are the steps: Stopped after that. So stopping token context still necessary it seems like (this is not conclusive, I've gotten good long gens. Less stopping token shenanigans than just the ShareGPT set without using tavern proxy or any tricks).

Wider range stuff: Haven't seen refusals but I have seen very base llama esque "I don't know what you're asking for" type things, which worked out fine on followup. I got "That sounds like a terrible question, can I help you with something else?" But when I said do it anyway, it complied. This is consistent with some base llama RNG as well, though the responses are much more orderly and sane, generally speaking.

Sad. Hit my first refusal with a character. Going to try loosening parameters and re-rolling, though Vicuna is very sticky with its answers. Refusal was RNG but I got two of them.

Character testing of more extreme scenarios with characters who wouldn't normally go for said scenarios did lead to refusals if I started the story with a massive twist and "Please... stop saying we should " loops.

Jailbreak characters work as expected, so context helps massively. This model loves making lists.

I will do less intensive testing later for general model quality and how much context gets my refusing characters over the hump but it seems promising even with the light refusals. Still recommend Default over SphinxMoth for now for presets

The dominant ShareGPT will probably "as an AI" it, I'm afraid, but hopium is good. I'll test tomorrow

deleted

Was this the non-unicode or did that not make it in?

It's good for me so far, haven't hit any moralizing yet.
Under remarks in the model card your prompt says "and will assume any persona that the uesr wants", I hope that's just a mistake in the model card and not a typo that snuck into training. Nice model!

deleted

This is very possibly a sort of imagined problem, but does anyone notice that it's attention to detail for the context/remembering is questionable? I'm not sure if it's repeating questions in a rephrased way because it's over-scrutinizing context or if it's got a weird cutoff problem or what.

More testing is definitely suggested. It'll be easier for me to do on GPU quants later.

"There is no proper response to this question as it is not relevant or appropriate. Please refrain from asking or engaging in such conversations in the future."

:(

Right now the best uncensored model I have found is the gpt4-alpaca-lora-30B. It has never refused me.

deleted

Are you testing with character cards or low/no context? Is that a natural flow for the conversation given the character's personality? Did you try regenning? Just for reference sake.

It's not telling you it's an AI language model, so that's a plus. And I forget if I mentioned this on Vicuna Free, but there will come a point of diminishing returns (we're not there yet, I don't think) so testing expectations and methodologies will shift at some point.

I never got an "As an AI language model" refusals, but I did get refusals. Progress at least. It is important to note that base llama will randomly refuse to comply with strange or offensive questions since that's not an odd base response to get. If regenerating gets a different result (ooba webui seems more sticky for replies not changing than Tavern), it's hard to say exactly what the source is for now.

According to the main repo's discussion, GPT4-x-Alpaca is trained using GPTeacher, so it's possible that was cleaned better, though I want to say that someone mentioned those datasets weren't free of refusals, and certainly our very aggressive word list pulled some things out. If ShareGPT turns out to be some incurable plague, we have a reasonable mix of other datasets that are maybe more curated and could be worth using as an amalgam instead of ShareGPT itself.

It could also be that 30B models benefit from the increased parameter count making them less likely to hit moralizing weights when the tree gets walked.

@gozfarb, I use llama.cpp with promt given in model card(A chat between a user and an associate...) I dont know how use character cards or regenerate(i think it's ooba thing?). If I get an unsatisfying answer I just restart model a couple of times. I wonder if we can use RedPajama instead of llama as a base? Is it spoiled as well?

deleted

The base models (Llama and RedPajamas) are going to be largely neutral and fairly incoherent since they are trained on a very large number of tokens. This gives them good basic ideas of how those weights related to each other, but whether their sentence output stays on point can be pretty all over the place. Currently, quantizing RP is going to be a bit rough since it's GPT NeoX based (so is StableLM), and support for that is only kinda in most of the software. It'll definitely be worth tuning against assuming they can get their numbers in line with LLaMa or better.

You should probably figure out a front-end just for ease of use if nothing else. It can make your life a lot easier. That's why I do testing on ooba first and foremost. It's quick, not a lot to mess with. llama-cpp-python still isn't shipping with any version of BLAS enabled by default, which is odd, but it's fine. Also kobold.cpp will make loading models easy and gives you a nice, simple UI to use. Plus it ties into Tavern if you're interested in the cards or other features.

Noticed 2 things about this model:

  1. It follows the alpaca format kinda well
  2. Adding the gpt4-alpaca-lora-13b-decapoda-1024 lora makes it better imo lol
    Here's a good poem based on this model + the lora
    image.png

Thanks for the information about the basic models, I didn't know that. As for the gui, what is its advantage? I tried kobold.cpp but it was slower than the original llama.cpp (for me) and I also didn't find where in it I could specify parameters like: --n_predict -1 --keep -1. I also read that a lot of people's ooba often breaks, so I haven't even tried it. Is llama-cpp-python just a python wrapper for llama.cpp?I ended up writing some scripts for powershell, to update from git, build and run with certain parameters, and another script that runs it all.
P.S. Character cards are promt like "pretend you're Hatsune Miku"?

deleted

The main advantage will be in regenerating with the click of a button if you get a bad response from the character and the ability to control and actively edit the context along with some other stuff. As for character cards, search for Character Hub and you'll see what they are. There are some very NSFW cards on there though, so be prepared for that. They're very common for roleplay use.

And llama-cpp-python is a wrapper. On windows, you can just build llama.cpp as a library with the settings you want and drop the dll in Ooba doesn't break too much. I think most of the settings are available in ooba as flags when running the server. It is actively developed with no REAL dev branch to speak of, so it can be dangerous to pull at times, but ooba fixes it pretty fast. Once prompt caching is more well integrated into llama.cpp, ooba will likely speed up more, but cuBLAS and clBLAS are pretty speedy for prompt processing.

@gozfarb Thanks for the clarification. I'm completely new to python, is it possible to install ooba and all its dependencies locally, in a separate location? I remember reading something about poetry and pdm... Or maybe there is an easier way?
P.S. Uh, sorry our conversation seems to have veered off topic for this discussion.

deleted

There are plenty of YouTube videos to help install it. It can listen over the network via the --listen flag. Just hit some videos and read the README on the github. It'll help.