Sampling Resources and Conjecture

#2
by Clevyby - opened

Hello, here will be talk of sampling. Anyways, I've been working on an explanation for my rationale on using my sampling preset which is mostly smoothing and min_p which would be used for upcoming reviews in the future. Would like some thoughts about it and some suggestions if there is. Rationale for using sampling preset: https://files.catbox.moe/xq0y03.txt

LWDCLS Research org

Maybe put the entire thing here directly, I think it'll be easier for people to see.

Sure, here's the whole thing, the reason I used catbox is so I don't want to particularly bloat my posts too much in any of my upcoming reviews.
Edit: Don't mind the space or formatting errors on the post below.

Warning: I am an amateur in this kind of subject, I'm not very technical but I know enough to rp decently. Also Images
are in urls.

Alright so, You may be wondering, why I use only smoothing and min_p for testing? Well to start off,
I use smoothing for it's dynamic characteristic. Dynamic in a sense where it would increase and
balance top token probabilities at the same time lower less likely ones incrementally depending on
smoothing value. The difference between smoothing and temp is that smoothing covers and considers
'all' tokens while temp focuses either mostly top tokens in low to mid temp values or increasing the
majority of the tokens that are not top prob. In short, with temp it's either you go either very
deterministic or very wild, and with smoothing, you can find all possibilities adjusting token probs
in any degree of determinism and or creativity, you can use smoothing curve to tinker with smoothing
more but as of now there's no option to visualize smoothing curve. Temp is stiff and Smoothing is flexible.
To visualize this more easily, check out the example images below which I got using Artefact's llm visualizing
token probs tool. Do note that the visualization of the token probs below are based on open-ended prompts,
as if I were to select token probs visualization on question prompts then it would be a question of factuality
not creativity.

Temp: 3, Min_p: 0.135:

https://cdn-uploads.huggingface.co/production/uploads/6580400298aa9fcdd244c071/O0qoPmaMfM3UXjgObqP2V.jpeg

Smoothing: 0.07, Min_p: 0.075:

https://cdn-uploads.huggingface.co/production/uploads/6580400298aa9fcdd244c071/HWEy-cOaQV9jC1B7W_2rh.jpeg

As you can see, smoothing has a pattern of visualizing the probs in a curve-like manner from the most
considered top token, to the least considered token, compared to temp's visualization where almost all
low prob tokens are almost on the same level with little prob differences compared to the vast prob differences
in top prob tokens. So yeah, with the flexibility of Smoothing, I used it to increase the diversity of
tokens by increasing the prob of tokens in between top prob and low prob tokens, why? Well a balanced diversity
of tokens would mean increasing creativity while being coherent to an extent. If one were to focus on top tokens,
in which I've noticed is a trend in many of the recommended sampling parameters, that would be somewhat limiting
in grasping the full capabilities of an LLM rp model. I just feel some tokens are underutilized and are segregated
with the rest of the low prob tokens which would not fare well for creativity. Now to achieve this diversity, I
used a low value of smoothing, because smoothing values are quite sensitive. If smoothing has a value of 0.1 or
more, token probs get more deterministic drastically, the opposite goes for smoothing value below 0.05.

Now, for the min_p part, it's for quality control. With me using low values of smoothing, probs of top tokens are
lessened and probs of low prob tokens are increased, thus I needed to enhance coherence further and cut off nonsensical
tokens. Though I used min_p as minimally as possible so I can retain many tokens for diversification. The high values of
min_p is so to keep up with smoothing's seemingly high temp side effects. So, min_p is quite proportionate to temp values.
So if I wanted more creativity, I would use smoothing value of 0.06-0.07, for more determinism 0.08-0.09. Overall, I
combined the dynamic creativeness of smoothing with min_p's coherence enhancer which is good enough to test the full
capabilities of any rp LLM model.

Sampling parameters I considered but did not make the cut to use for testing:

Top_p, Typical_P, Top_K - They specialize in cutting off tokens in a manner where top tokens are the only ones considered.
If tokens are a bonzai plant, min_p are shears while they're axes, unweildy to use for a small plant.

Tail Free Sampling - I've heard anecdoctally that this is somewhat similiar to min_p, well with artefact's llm sampling tool
it might be true, though min_p is preferrable as it's simpler to understand and more objectively measurable than Tail Free
Sampling

Top_A - Min_P is more exact in coverage than this.

Repetition penalty parameters - I prefer to use sampling parameters that are universal in usage, if I were to use this effectively
I'd have to be specific, very very specific for each model for their parameter numbers (7b, 13b, 20b) and repetition penalties
are like bombs, affecting any tokens in their range, both the good and bad. Besides I can just increase creativity to offset
repetition altering either smoothing or min_p. Also argued by Kalomaze to be unreliable for most models

Mirostat - Argued by kalomaze in a reddit post to be unreliable somewhat. There was an anecdotal claim that this is basically
just Top_K = 1000000. Also I don't read scientific math.

Temp, Dynamic Temp - Good ol reliable, but stiff compared to smoothing.

No repeat Ngram Size - Uh... apparently used for repetitive phrases but same reasons as Repetition penalties.

Beam Search parameters - Dunno what it is, too technical.

Contrastive Search - Requires sampling in general to be disabled so no.

Supplementary Links:

https://artefact2.github.io/llm-sampling/index.xhtml

https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/

https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e

https://gist.github.com/kalomaze/4d74e81c3d19ce45f73fa92df8c9b979

https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/comment/k9c1u2h/

LWDCLS Research org

For what you intended https://rentry.co/ is a better option.

I see, thnx for the rec. Though it'll take time to setup, I'll set it up in the future.

LWDCLS Research org

At least there it's just Markdown so you can make it look pretty but it will be same as here tbh.

LWDCLS Research org

@WesPro Sampling info den.

Hello, there's some sampling presets I came up with after some time of experimentation using artefact's sampling tool. There's 2 presets, one that encourages creativity and the other that encourages determinism The purpose of these is to discover certain tendencies or quirks depending on the LLM model when faced with varying levels of creativity. I'd like some feedback if you feel like testing these:

Creative Ouroboros:

Temp: 1.02, smoothing: 0.06 min_p: 0.11

Deterministic Path:

Temp: 1.01, smoothing: 0.085 min_p: 0.04

LWDCLS Research org

Next time I'm doing actual testing I'll try to check but I'm more of a feels and structure formatting test person, still.

I didn't particularly put formatting in mind as I used ST's auto formatting, and most models I used that are 7b beyond don't get it correct occasionally at some point, and it's also influenced by the setup in character cards.

If you do encounter certain 'undesirables' while using the presets (Like model speaking for you, advanced formatting/author note leakage into responses, weird symbols), and want to see what responses could look like without such, you can try using the logit bias feature to ban tokens ('token' -100) . For example, I was using Iambe V3 20b and the model was always steering into talking for me, to remedy that, I used logit bias to ban tokens of my name (which was Clev and 'derivatives' as model wants to cheese me despite my intentions) and a singular pronoun 'I' (as model tried to switch to 1st pov from user). Afterwards, responses were good.

LWDCLS Research org
β€’
edited Apr 21

For formatting I mean in the final responses, some cards I like have...

Stats: owo
Like: owo
This: owo

...that change as things progress, and the Llama-3s haven't been doing very well with that.

I actually have a 7B in mind I use for comparison and it amazing with the final response formatting.

Nyanade is my point of comparison, even if the Llama-3s can be better at certain aspects - keeps track of characters and scene in general a lot better - I get too bothered by misplaced quotations and asterisks or messing up of the RPG style stats. I need the model to jive well within ST first and foremost.

Here's a slightly less creative version of Ouroboros:

Forked Road:

Temp: 1.03, Smoothing: 0.07, min_p: 0.09

Update on presets:

  1. Creative Ouroboros V2 ( Lessened low prob tokens a bit and increased creativity):

Smoothing: 0.05 min_p: 0.145

  1. Deterministic Path V2 ( Increased Min_p harshly, purpose is to evaluate model's supposed choices on what they think is 'best' )

Smoothing: 0.085 Min_p: 0.085

LWDCLS Research org
β€’
edited Apr 25

Considering Min-P acts with the token probability to set the minimum possible % of a token to be considered valid for selection, isn't a more Deterministic approach bringing the value closer to 1 (100%)?

For example Min-P 0.15 is a lot more deterministic than Min-P 0.075.

Screenshot_2024-04-25-06-43-41-867_org.cromite.cromite.png

Screenshot_2024-04-25-06-44-32-001_org.cromite.cromite.png

Or are you meaning to compensate that with Smoothing?

Lewdiculous changed discussion title from Sampling to Sampling Resources and Conjecture

Considering Min-P acts with the token probability to set the minimum possible % of a token to be considered valid for selection, isn't a more Deterministic approach bringing the value closer to 1?

For example Min-P 0.15 is a lot more deterministic than Min-P 0.075.

I mentioned before that effective usage of min_p is proportionate to temp and since smoothing is a form of standardized temp, that applies here. You can see my basis on that by using artefact's llm sampling tool. There, Increase the temp, and all token's probability is increased, you can see the visualization of llm's token probs by enabling min_p. As such, if say you input a temp value of 3 and set a min_p value of 0.05, you can see that the shaded line that min_p makes does not even reach any token at all. Sure maybe it does, but it would just reach to the very very far end of llm token prob distributions that it would practically not matter as the llm may not even consider that end for token selection.

Ex, Temp 3, min_p 0.05:

Screenshot_2024-04-25_175406.jpg

So I suppose that if min_p is to be deterministic, then min_p has to reach a minimum value in order to catch up to the tokens that are heightened or affected by temp, then increase said value more accordingly so it covers a good chunk of llm tokens.

LWDCLS Research org
β€’
edited Apr 25

Maybe the % value of the Top Token wasn't high enough in this example as that's what sets the baseline % from Min-P? As Smoothing is clamping probs together making the effects of Min-P lessened, so in a way you'd think you'd need a lot more Min-P when combined with Smoothing?

Maybe the % value of the Top Token wasn't high enough in this example as that's what sets the baseline % from Min-P? As Smoothing is clamping probs together making the effects of Min-P lessened, so in a way you'd think you'd need a lot more Min-P when combined with Smoothing?

The lower the smoothing value, the more 'temp' it bestows into the llm token prob distribution. The more high the 'temp' factor is affecting the llm token prob distribution, the more value is needed for min_p to catch up to tokens.

Also, I've noticed an.. interesting discovery when testing presets. I was testing said presets when I noticed that the gens were using certain words repeatedly. I was tired of that and decided to try lowering smoothing to 0.01, and for some reason, gens were coherent, theoretically it wouldn't be possible. So I checked the token probability checker tool in ST and I noticed that the llm model was choosing the top tokens most of the time, it was as if the smoothing I chose was negligibly ineffective. Then I noticed that perhaps last temp option was affecting the model's behavior to be deterministic. I unchecked said option and llm model started to finally consider lower prob tokens at an increasingly large rate. With that, I'll have to say that the previous presets I made are now not good to use. Here's a new preset assuming temp last is not checked.

Creative Ouroboros L2:

smoothing: 0.06 min_p: 0.1

LWDCLS Research org
β€’
edited Apr 25

I believe temperature last is the Default and recommend sampler order for GGUF models loaded under KoboldCpp. As such I believe optimizing for that scenario covers most people as result. Usually it means the raw temperature value has a much lower impact.

I've been using text gen mostly. I haven't checked koboldcpp. And I'm almost sure temp last (in text gen) just influences llm model behaviour not token probs to be exact.

Update on Preset. Last one was too incoherent.

Creative Ouroboros L2 (temp last off): smoothing: 0.07 min_p: 0.075

Update on presets: Last one was also too incoherent, after testing on a 20b, there's two presets I found that are very, very similar in token prob difference yet have a noticeable effect on gens.

C.O. A (temp last off): temp 0.88 min_p: 0.055 smoothing: 0.06 (This preset is the best I can go to make gens coherent with last temp off)
C.O. B (temp last off): Temp: 0.95 min_p: 0.055 smoothing: 0.07 ( This preset is more creative and has a tendency of skewing words. )

After some testing, I conclude that C.O. B is the better preset. C.O. A is quite deterministic in word choice. Also C.O. B was really fun. It's almost like I'm watching a comedy or writing that's somehow written by a person. Did someone hijack my LLM and started writing for me? Jokes aside, the C.O. B preset (temp last off) is quite creative, you should try it some time, though the preset itself is kind of barely holding it's coherence together it works either way.

Quirks you may experience using C.O. B:

  1. Names may get skewed.
  2. gens may or may not be incoherent, just refresh a few times.
  3. Might have wild exaggerations of personality
  4. Hallucinations can happen rarely though can be easily resolved by refreshing.
  5. Gens may be plain incoherent at the start, just refresh a few times, and it'll settle in.
  6. Dunno, you tell me as it probably depends model to model.
  7. Good Formatting is Nil.

Testing done on 20b DarkForest v2

Responses:

Screenshot_2024-04-27_210656.jpg
Screenshot_2024-04-27_210916.jpg
Screenshot_2024-04-27_211957.jpg

LWDCLS Research org

Inconsistent formatting hurts my soul. Good job on testing.

Update on Preset. So, turns out that turning off last temp in text gen made llm behavior be super wild in choosing tokens, I was fine with it until information consistency was broken. So I worked on testing again, which took me some time. After some testing, I was surprised I had to use very deterministic settings to get it right, but I think I found the threshold with a balance of creativity and keeping information consistent. last temp is practically a pandora's box.

Pandora's Box (last temp off and text gen): temp: 0.99 min_p: 0.02 smoothing factor: 0.1

I think preset's stable enough compared to the previous one, formatting's definitely improved, only quirks I see here is probably misspellings and a very low chance for incoherent responses, it's a good price to pay for some creativity. Also I have to note that this preset was made using text gen, as I think text gen's last temp option makes text gen different from other backends

Turns out that text gen's loaders react differently with presets. Found this out when switching to llamacpp_hf. So here's the preset:

Pandora's box (last temp off, text gen only, sampler order: temp, smoothing, min_p):
exllamav2_hf: Temp: 0.99, min_p: 0.02, smoothing factor: 0.1
llamacpp_hf: Temp: 1 min_p: 0.015, smoothing factor: 0.11

Pretty sure there's a correlation of sorts. I just realized that these two presets are just similar when I was just done testing for a few hours. Anyways, the reason for the tight settings, is because the more I use lower values the more the gens become more.... complicated esque. It has a trend of gens using fancy words then just gobbling said words together to make it look complicated but not grammarly sound.

After plenty of testing again.. I can confidently say that this preset's definitely better:

Pandora's box (last temp off, text gen only, sampler order: temp, smoothing, min_p):
Temp: 0.99, min_p: 0.05, smoothing factor: 0.09 (20b)

Edit: Turns out, that the preset here is mostly optimized for 20b.. Noticed this when testing with 14b, so it appears I'll be working on a preset series based on this. And it'll take a while to find a sweet spot on each different model parameters.... damn you last temp.

@Clevyby

I don't think it is a good idea to set temp before smoothing and minp, it reduces the coherence of models.

Are you still playing with samplers?

I am experimenting with cutting off tokens with TypicalP and MinP, then using Smoothing and Dynamic Temp.
I think this will give us more coherent results, and we still get some variety from sparsifying better tokens.

https://huggingface.co/Virt-io/SillyTavern-Presets/blob/main/Samplers/%5BTest-01%5DRoleplay.json

@Virt-io I know the presets themselves sound theoretically bad, actually I've been making good progress so far. I've been using artefact's tool as an aid. Here's the situation: I was playing with samplers a few weeks ago and decided to set smoothing to 0.01 for the heck of it. Usually, the result of this would be the responses would output gibberish theoretically, but to my surprise the responses turned out to be coherent. I did some poking around and it turns out that the reason was the 'last temp' option. I also checked the model's thinking in terms of choosing tokens based on set probabilities (via ST's probabilty checker) and it turns out that the problem isn't with the samplers which change token probabilities, it's 'last temp' itself.

The 'last temp' option modifies the llm model's behavior in such a way that the model would always be deterministic in nature, which would heavily negate the influence of temp samplers no matter their value. So I decided to experiment with 'last temp' off. And well.. It took me some hours, plenty of it to make a decent preset with the premise that temp was first and I did succeed at first. But it turns out that 'last temp' has a unique effect on every category of billion parameters (7b, 20b), which means that the preset I've had success with as of late was only optimized towards 20b's. So I decided to experiment and try to discover a unique preset with 'last temp' off for a unique 14b ( only has 25 layers for some reason). And I did try to emphasize token diversity whilst testing.

In the end, I found the ideal preset to be ( Temp: 0.97, Top_a: 0.01, smoothing factor: 0.21 ) for 14b. This was the preset that finally keyed in the coherence factor despite 'last temp' off, now while the nature of these optimized presets sacrifice a good portion of low probability tokens and are quite deterministic in nature, I would tell you that if I were to move even a digit lower or higher (mostly on temp or smoothing) from these presets, gens would become wild to such an extent that informational consistency and coherence in general would be broken. That's how wild first temp is. That's why I'm not worried about the presets being deterministic, as first temp is so wild that it would still be creative even if it's tamed.

As of now. I'm currently working on an old optimized preset for 20b's to see if I can squeeze in some more tokens for diversity. After that I'll work my way down to lower billion parameter models slowly until 7b. Overall I suppose the only cons with the upcoming Pandora's Box preset series may be a lessened diversity of tokens. But it can make up in terms of creativity.

Now towards your usage of samplers, I'm not sure about typical_p and using a combination of smoothing and dynamic temp together. Typical_p is basically top_p with extra stuff and top_p in nature really only focuses on top tokens which is not good for diversity. Using smoothing and dynamic temp is problematic as in my experience they usually result in garbled gens. Smoothing is really dependent on temp as a baseline, it's probably viable if dynamic smoothing was a thing. Since your cutting off tokens, I'd like to suggest these other samplers: Top_a and Tail free sampling. In the context of cutting off tokens either min_p, top_a or tail free sampling would do a great job by themselves, using only one of them is ideal to not complicate things. For a better understanding, think of it in this analogy, if token probability distribution was a slab of meat, using tools to cut parts of it would be like this:

Top_a: a scalpel, min_p: a normal knife, Tail free: A big and thick cleaver.

@Clevyby

I will use only MinP then.

Can you share the file for your sampler preset?

The reason I am using Dynamic Temp is because of this.

2024-05-06T08:11:02,030006354-05:00.png
image.png

Can you elaborate on the sampler preset your referring to? if it's the optimized preset for 14b, you could simply input the values and turn off last temp, only difference there is that I used chatml template for the 14b I was testing on. Do note that it's strictly for that unique 14b, I have yet to work on presets for 7-8b. For the dynamic temp, just lower smoothing, you can use specific values up to 2 decimal places.

Yes, the 14B preset.

This is so confusing, they all do the same things in different ways.

Here it is. By the way the 14b I was referring to is lynn a2.7B-rp-v1, You can check the gguf quants here

Pandora's box series (Last Temp Off, sampler priority: Temp, smoothing, min_p/top_a,):

A: Temp: 0.95, Top_a: 0.36, smoothing factor: 0.09 (20b)

B: Temp: 0.97, Top_a: 0.01, smoothing factor: 0.21 ( Unique 14b)

C: Temp: 0.99, Top_a: 0.19 smoothing factor: 0.11 (13b)

D: Temp: 0.95, Top_a: 0.2, smoothing factor: 0.1 (10.7b)

E: Temp: 0.98 Top_a: 0.18 smoothing factor: 0.11 (9b)

F: Temp: 0.98 Top_a: 0.27 smoothing factor: 0.1 (8b)

G: Temp: 0.98, Top_a: 0.05, smoothing factor: 0.14 (7b)

Quite frankly, I'm not sure if these presets do have a sort of difference compared to enabling last temp. Also I did personally test them and adjusted the presets to the point where strange incoherencies are mostly moot. Also did try to emphasize token diversity whilst testing only ones with exceptionally less diversity of tokens are 14b. The reason I used Top_a is because it's actually more precise than min_p in cutting off tokens. I used llamacpp_hf loader testing these with the exception of A (exllamav2_hf). Also would like some feedback based on these

Quick question for those interested: I've noticed that Top_K is quite nifty. Top_a, min_p, and TFS are good cutting token tools but top_k is quite extremely precise that it's ability to cut off tokens is more accurate and precise compared to said tools. You can easily see the effect using top_k in artefact's tool. With that in mind, should top_k be used exclusively with it's superior viability in mind or should it be used in conjunction to said tools? I've seen an instance whilst using kobold horde that using top_k exclusively with high temp may result in incoherent replies but I'm not sure about that. Will find out tom.

So far top_k seems decent enough for me to implement exclusively to replace top a. Using artefact's tool as a guide, it's easily navigational. Besides, using other token cutter tools do tend to limit to less than 50 tokens when inputted with a small value.

Been a while. Just checked in at the other day and turns up I was wrong about Top_K. What I mean is that while Top_K is a good alternative for keying in the exact tokens instead of messing with values using other token cutter tools in the hopes of getting a specific token prob, it literally forces the number of tokens being considered. Unlike other token cutters which are dynamic. Top_K would force the number of tokens considered. Even if the question is common sense to answer, having a high value of top_k would make the model consider a lot of tokens that are not viable to answer the question. Top_K is not as flexible as other token cutter tools. As a result, while gens may be good at first, it could just switch it up and insert a symbol it is considering forcibly, even if it's prob is 0.0000x which should be out of the question to consider it and then it would turn into a slippery slope of incoherence.. So back to Top_A.

As of late, If your using the latest ST staging and use text gen, you may notice a new parameter, the DRY repetition penalty. It's a new feature that sounds promising, as of now it's implemented in the dev branch of text gen. Haven't tried it out yet, will do later. What do you think of the new penalty filter?

Sign up or log in to comment