Exllama 2 quant 3.9 Bits per weight 6 bit head of Mirai FP16

Previous Rendition:

This is the 43rd evolution. I like this model so much I'm calling it a 1.0 release. This is by far the most liked version of any of the evolutions I tested. The prose is excellent, instruction and especially the huge amount of variety and intrigue is really something.

Some random advice:

Try entirely disabling repetition filtering/penalties. I found overall the model improves substantially in others areas and the repetition seems quite low (for what I'm testing.) This model can withstand very high temperature, though of course it has the typical downsides, but I'd play with between 0.3-2.5. It's stable-ish even into the 2.X range, though I'd only recommend that for really wild experiences more akin to an Ai-dungeon run. Finally, I use a min-p of 0.05, too low and there's obvious linguistic issues in other languages like korean popping up. I would not go much higher than 0.05.

Remove DISLab/SummLlama3-70B

This model is certainly responsible for short responses, and I removed it because they were too short on average for my tastes. The writing style seemed to converge a bit as well, in a way that I found to get boring.

Add WhiteRabbitNeo/Llama-3.1-WhiteRabbitNeo-2-70B

Low impact model, seemed to contribute very slightly to a more behaved and interesting conversation. Strangely, this model actually seemed to primarily improve the instruct capabilities more than anything else, and I did not find any real evidence of downsides.

Add TsinghuaC3I/Llama-3-70B-UltraMedical

I tried both this and aaditya/Llama3-OpenBioLLM-70B and I found that this was the better performer, and in fact open bio seemed to have a negative impact more than anything. I most strongly noticed a contribution to grammar and the general wordiness of the model. It also pretty clearly added some new info domains when asking raw knowledge questions. Likely a good addition.

Add allenai/Llama-3.1-Tulu-3-70B

Tulu is back. I changed around how I do personality prompting and with all the other changes, I'd say this model is a fine addition now. It's quite hard for me to really get a measure on the changes, but it does seem to have improved instruct and likely contributed a bit to a sunnier overall personality. Unfortunately, I'm certain this added a moralization element, it's not overly strong but you can certainly pick up on it at times.

Add hitachi-nlp/Llama-3.1-70B-FLDx2

Talk about unexpected. This might be the single best model in the entire merge. This model did so many things it's hard to count, but the impact is extreme in many areas. The lowest impact was to personality, which I did not notice a shift in at all. The logic, reasoning, and recall of the model improved dramatically, although it's somewhat more selective in its recall. The most substaintial thing is that repetition is very reduced, to such a high degree I've entirely turned off all repetition penalties and did not notice an issue. I think this model may have the most developed anti-copy heads of any model in the merge. This model was a total shock to me but a very welcome addition.

Model Architecture

This is a stockmerge model. Thanks mergekit for making this really easy to do.

Base:

huihui-ai/Llama-3.1-Nemotron-70B-Instruct-HF-abliterated

Stock:

Sao10K/L3.1-70B-Euryale-v2.2
rinna/llama-3-youko-70B
yentinglin/Llama-3-Taiwan-70B-Instruct
meta-llama/Meta-Llama-3-70B
PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B
tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
Bllossom/llama-3-Korean-Bllossom-70B
(Custom) MergedHistLlama-70B
WhiteRabbitNeo/Llama-3.1-WhiteRabbitNeo-2-70B
TsinghuaC3I/Llama-3-70B-UltraMedical
allenai/Llama-3.1-Tulu-3-70B
hitachi-nlp/Llama-3.1-70B-FLDx2

The rest of this is going to go over both my rationalization for the approach, in the form of a totally unstructured series of rants, including my goal, and the observations I've made.

Selection

The model selection was done using an evolutionary approach where each generation added or removed a model and the outcomes were tested by (me) without the use of any automation or benchmarks.

Why are these models candidates for selection?

The long story short, there's few choices for large trained models that are very different from eachother. Almost all RP models use similar datasets, and often exclusively train on a small distribution of specific data. This often means they differ numerically only very little from their root model, which makes merging less impactful. For this reason, many common roleplay models have been avoided as candidates, and I've only selected a single Euryale 2.2 from Sao. I did try two others, and the merge was significantly worse. This leaves the actual selection, which was the most varied models available in the limited selection, and let the evolutionary process decide the rest.

Base Models VS Instruct Tunes

A very important point that is often missed is that instruct is often where the moralizing, dehumanization, and overcompliance comes in. This is not that surpising, instruct is designed to be compliant, and the nature of instruct tunes leads naturally to strong biases in the models vocabulary and personality. In this way, the goal was to use the base model merges like youko and the original llama 3 70B base as regularization to pull back from some of these learned habbits of the instruct models, which are not present (Or perhaps I should say, as prominent) in the base models.

Is Merging even Effective?

I am not a merge believer, but I have to concede the objective fact that model stock works very well. It's the only merging method I've had success with, and my experiments with it continue to impress me.

Why Nemotron Base for RP?

I'll discuss this later in more detail, but roleplay is more than raw knowledge. Having a diverse range, strong vocabulary, and good understanding is important to making the world feel real, where actions have consequences that are understood by the model. I found nemo to have the best capacity to understand (zero shot) implications and hidden meanings. It's even quite good at explaining them if you ask. However, the model has many damage points, it tends to generate in increasingly incoherent ways as time (In terms of rolling over the context window) goes on (Which is fairly common, but quite bad and very obvious in nemotrons case.) With correct sampling parameters, this model performs okay on it's own. However, merging solves the severe degredation problem, improves it's prose, but retains the "smart" qualities of it. I found this to consistently be the best base model for the merge.

Sampling

My personal advice: use less sampling with big models. I use a very low or no repition penalty, from 1-1.03. This is a fairly low amount. I've liked between 0.9-1.5 temp, adding min-p 0.1 might be helpful although I personally like the rare distributions to come up sometimes and lead to wild things. An alternative method is to do 3.0 temp 0.5 min-p and turn temperature last, which will give you near-greedy sampling except when the model is uncertain, you'll need higher rep-pen usually as the model starts to get repitiious if you aren't careful. Finally, almost all sampling setups and samplers just make the model worse. Temperature adds randomness which in roleplay is more interesting, but it will also artifically introduce errors, which then cascade. A simple example is if the character has a red shirt, and you ask the color of their shirt, if your sampling is bad it might force the model to spit out something random, like a maroon shirt. Basically, use the least amount of sampling (as close to greedy as possible) that gives interesting experiences is my recommendation.

The Goal

The overall goal of this model is a more human experience. A good roleplay model has much more nuance than is often discussed, so I'll discuss them.

Knowledge

Knowing things is what llms are good at, really good at. The issue is that they know things in a probability space, and sometimes the context and logits don't let them find the knowledge encoded in the weights in certain conditions. This means paradoxically that LLMs can know and simultaneously not know things, or also know things in multiple different ways. The reality is that people care about the perception of consistency. As long as the model appears to consistently represent knowledge, I call this a win. One of my foremost goals is that the model represents both in-distribution and in-context things in a fairly consistent way.

"Unspoken Understandings"

People sometimes say things directly using indirect language, and rely on the listener to pick up the meaning, these are implications. They can also happen in terms of actions. Consequences share a similar idea. Language models really struggle with both of these categories of things. I would say, this model has a moderate ability to grasp these things, but has not met my standard of consistency. I imagine large book datasets and stories would be the most representative of these ideas, but often times consequences are long-term dependencies in stories, which are even harder to represent in short-window contexts that we currently have. Likely, this will not massively improve without longer context windows and better comprehension over them.

Homogenous Representation, Positions, Locations, Distance

Surpising nobody, the purely textually learned language models with no sensor organs or actual world experience, don't have a great understanding of physics, in a literal sense. Where things are, how they move, all of these things they learn exclusively through text. Models have picked up on some of these, but in a clearly pattern-based way, doing physically simple things often confuses their ability to locate objects, keep positions sensible, and represent object proportions homogenously. These are very immersion breaking problems, as if I swing a sword through the villian, you expect them to be cut. You can see, this combines two problems, an understing of physical actions, and the consequences of it. Luckily, for most tasks these models have picked up some ability to fake their way through, but I have made a mental effort to select for this more than for other features.

Personality, Characterization, Seperability

For a basic, two character roleplay, you expect the one character to maintain a somewhat coherent persona. This is very difficult to shape with language models while keeping all other faculties alive. This can result in dumber characters simply because they're acting in a role. A seperate issue is that introducing more characters tends to blend their personalities together over time, this model is surpisingly good at keeping a few distinct characters, but they must have started distinct. Two somewhat similar characters will almost always blend together after awhile. This is almost certainly fixable with custom types of tunes but is not at all a trivial or easy fix.

The World

World consistency is very very difficult. This is unsolvable in a true sense because of limited context windows and models lack of any other memory abilities. So I'll keep the scope to the world in-context. I've found that among all things, the world tends to be quite accurate in-context provided there's a limited degree of change. This is sort of the case sort most things, if a dragon destroyed the castle, and you had 6K context of in-castle rp, you might find the castle is suddenly back somehow, undestroyed. This again goes back to consequences and consistency, overall I think the world consistency is actually quite good in-context however, and I've found most even small models can do this accurately.

There's a whole lot left unsaid here, but these are the main ideas. When testing models, these are the things I have in the back of my mind. How is the consistency, the world, the personalities, etc. Other things I look for are general prose, and the issues I outlined above with common llm-isms. Selection was made to get the best of these qualities that still feels "human-ish" and interesting.

Why not Fine-Tune instead?

Cost. I've fine over 200 small 7 & 13B models on various datasets, and I've had very good results. However, data selection is very tedious and often you need many attempts to get what you're looking for. I would in fact love to do continued pre-training on the base model with my custom datasets, but the reality is that this is too expensive for me right now.

What else can be done?

Anthropic, the claude people, published this scaling monosemanticity awhile back. (Similar to Abliteration: https://huggingface.co/blog/mlabonne/abliteration) It allows the manipulation of features without classic fine-tunes, via interventions. I've been conducting several experiments to see how these can be used not only to shape the model but also dynamically change it with "feature sliders." Again, though, playing with large models is very expensive.

Blackroot
/

Mirai-70B-1.0-3.9B-6H