LLM coping mechanisms - Part 5

#12
by Lewdiculous - opened
LWDCLS Research org
β€’
edited 14 days ago

Well, well, these are trying post GPT-4o times. What does the future hold for Llama, and everything else? Don't miss the exciting new chapters!

Apologies if this tangents too hard.

This is a direct Part 5 continuation of Part 4 in this thread.

Lewdiculous changed discussion title from Llama 3 coping mechanisms - Part 5 to LLM coping mechanisms - Part 5

@saishf @ABX-AI @Endevor @jeiku @Nitral-AI @Epiculous @Clevyby @Virt-io @saishf @nbeerbower @grimjim @localfultonextractor

Well, well, these are trying post GPT-4o times. What does the future hold for Llama, and everything else? Don't miss the exciting new chapters!

Apologies if this tangents too hard.

This is a direct Part 5 continuation of Part 4 in this thread.

Coping for june , maybe multimodal l3? We wait and cope more.

Lewdiculous pinned discussion
LWDCLS Research org
β€’
edited 14 days ago

[Relevant comment transfered from @grimjim from previous discussion.]

The failed reasoning in my tests with a 7B seem to revolve around determining that steel is denser than feathers, and then halting there rather than chaining in conversions.

I stumbled onto the fact that this model that I released with little notice a couple of months back recently got quanted by two of the current high volume quanters. I have no idea how this happened, but this was a few days after someone came across my post about it and noted that it was a good model? This was a merge where I took a successful merge and then remerged it with a higher benching model, so this appears to support the meta about merging in reasoning, which I will apply to some eventual L3 merges.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v2-32K-7B

I'd been sitting on another 7B merge, and finally got around to releasing it. Starling was never meant to be an RP model, but it seems to have helped in conjunction with Mistral v0.2.
https://huggingface.co/grimjim/cuckoo-starling-32k-7B

Well, well, these are trying post GPT-4o times. What does the future hold for Llama, and everything else? Don't miss the exciting new chapters!

Apologies if this tangents too hard.

This is a direct Part 5 continuation of Part 4 in this thread.

Coping for june , maybe multimodal l3? We wait and cope more.

Knowing it took near 3 days to cook llama-3 8B and Meta claimed that Llama-3 was still learning with further training. I guess they pushed Llama-3 out early to free up GPUs for the 400B model?
I can hope for a further trained or VLM version. 34B would be nice for the 24GB vram users too.
150T token Llama?

We made several new observations on scaling behavior during the development of Llama 3. For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data. Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens. Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.

openbmb/MiniCPM-Llama3-V-2_5 MultiModal model that claims to surpass the old GPT-4V
MiniCPM-Llama3-V-2.5-peformance.png

πŸ”₯ Leading Performance. MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 with 8B parameters, greatly outperforming other multimodal large models built on Llama 3.

Huggingface doesn't list GPUs older than Ampere(30) & still the 3070Ti, 3070, 3060Ti, 3060, 3050 are missing 😭
https://huggingface.co/settings/local-apps

openbmb/MiniCPM-Llama3-V-2_5 MultiModal model that claims to surpass the old GPT-4V

I'm sure it does πŸ™„
/rant
Soon enough, even models with <1B parameters will claim to 100% all tests.
/endrant

I'll still give it a go, even if i'm more interested in audio in/out than pictures.

the other phi 3 models dropped, incl a vision model ;)

https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

My best 7B merge yet, I dare say. If the card has a style format and you keep to it, the model will stick to the format. It is very attentive to the prompt, and is capable of introducing new elements to drive plot.
https://huggingface.co/grimjim/rogue-enchantress-32k-7B

apparently the only big differences are that the tokenizer's vocab got bigger? they haven't really said whether or not their dataset changed or anything so this might not be tooooo impactful lol*

*edit: apparently the instruct supports function calling though so it's pretty likely they changed SOMETHING in the data of the base model

Ooooh nice! Natively trained for function calling, and the base model not lagging 6 months behind. Yes, please.

My best 7B merge yet, I dare say. If the card has a style format and you keep to it, the model will stick to the format. It is very attentive to the prompt, and is capable of introducing new elements to drive plot.
https://huggingface.co/grimjim/rogue-enchantress-32k-7B

@grimjim Nice. There's a dramatic lack of Mistral 0.2(base) models. I'll have a look next weekend as your description is perfect for my use-case.

Mistral didn't put up the v0.2 base weights up on HF, although they did upload v0.2 instruct. SLERP merges of v0.1 with v0.2 work in general, but v0.2 base didn't capture the interest of fine-tunes due to obscurity. Will have to try out merging v0.1 with v0.3 to see if the result is comparable.

Mistral didn't put up the v0.2 base weights up on HF, although they did upload v0.2 instruct. SLERP merges of v0.1 with v0.2 work in general, but v0.2 base didn't capture the interest of fine-tunes due to obscurity. Will have to try out merging v0.1 with v0.3 to see if the result is comparable.

I'm very aware of that πŸ˜”. It's sad because the rare base 0.2 merges/tunes I tried tend to be exceptionally good at context/prompt adherence. And yeah, hopefully 0.3 will help fix that.
I very quickly tried your model, btw. So far good, so good. I'll post a feedback topic on your page in a couple days, when I get the time to go through my usual tests/scenarios.

Feel free to drop minP down lower than 0.02 if you want to give creativity an additional boost if swipes end up being too similar.

For what it's worth, I've succesfully been able to merge in float16 v0.1 and bfloat16 v0.1 models with bfloat16 v0.2 in the past. My current thinking is that DARE-TIES should be avoided, as it would punch holes in denser models.

Will have to try out merging v0.1 with v0.3 to see if the result is comparable.

@grimjim likely not, the tokenizers are different and incompatible so you would have to do quite a few crimes to properly merge them

image.png

Google should open-source the context extending magic @_@

https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Figure 1 | Gemini 1.5 Pro achieves near-perfect β€œneedle” recall (>99.7%) up to 1M tokens of β€œhaystack”
in all modalities, i.e., text, video and audio. It even maintains this recall performance when extending
to 10M tokens in the text modality (approximately 7M words); 9.7M tokens in the audio modality
(up to 107 hours); 9.9M tokens in the video modality (up to 10.5 hours). The x-axis represents the
context window, and the y-axis the depth percentage of the needle placed for a given context length.
The results are color-coded to indicate: green for successful retrievals and red for unsuccessful ones.
Note that the performance for all modalities is obtained with the previously reported Gemini 1.5 Pro
version from February

Also this is a cool idea

Introducing Kraken-LoRA – a lightweight version of Kraken that uses LoRA-Adapters as Experts based on the base model.

https://huggingface.co/posts/DavidGF/503105573610507
Would be cool for different writing styles within one model

preparing a cookπŸ™

image.png

first genuine attempt to finetune l3 after all the teething issues were fixed, lets see how it goes

hm okay so checking in at halfway through an epoch it could probably be worse. maybe. (for the record, initially this is only going to be a raw text-completion story writing model that i'm making! instruct will come later when i have better datasets ;) )
my prompt:

<|begin_of_text|>Title: Inky Desires [Bendy x Reader]

the completion:

Description: I'm not very good with smut so I apologize in advance.
Tags: ['bendy', 'bendyxreader', 'fanfiction', 'fluff', 'ink', 'lemon', smutwarning']

Inky Desires [Bendy x Reader]

A/N: I'm not very good with smut so I apologize in advance. Also, this was originally written in the first person, but I decided to change it to third person
for the sake of clarity.

3rd person POV:

The rain fell in sheets, splattering against the window pane. The young woman in the bed watched the droplets run down the glass, her eyes dull and lifeless. Her mind was filled with the same
thoughts as always: she had failed her family, failed herself, and failed to live up to the standards of her society. She had nothing to offer the world, and the world had nothing to offer h
er. She was a failure, and she would always be a failure.

The woman's name was Alice, and she was a member of the Ink Machine. She had been a member for as long as she could remember, and she had always been considered a disappointment by her peers.
She was weak, she was useless, and she was a burden to the group. She had never been able to live up to the expectations of her society, and she had never been able to escape the feeling of
failure that weighed her down.

Alice had always been different from the other members of the Ink Machine. She was quiet, introspective, and she had a deep love of art and literature. She was always seen as a bit of an oddb
all, and she had never been able to fit in with the rest of the group. She had always felt like an outsider, and she had always felt like a failure.

The woman had always been fascinated by the Ink Machine, but she had never been able to fully understand it. She had always felt like she was missing something, like she was not meant to be a
part of it. She had always felt like an outsider, like she was never meant to be a part of the group.

But one day, everything changed. The woman had been walking through the halls of the Ink Machine, lost in thought, when she heard a soft voice call out to her.

i think my hyperparams are fucked up though, so it could be better. back to the grind

Can confirm that mergekit will not mix Mistral v0.1/v0.2 with v0.3 out of the box, not with the larger matrix size of v0.3 at least. It's not a type of crime currently permissible in mergekit.

i mean. you could theoretically rip out the mistral 0.1/0.2 tokenizer, replace it with the 0.3 one, and retrain the embedding and lm_head layers to work with it for usage in merging (why you would go through all this effort for crime i do not know, but you theoretically could!)

I was thinking of a far cruder crime, of merging with "padding" for a matrix size of 32768 instead of 32000. I'm curious if the brute force equivalent of shoving a square peg into a round hole would work.

I was able to merge base with instruct just fine for v0.3.

I suppose frankenmerges to splice v0.2 with v0.3 are theoretically possible. It will probably end in tears, but it's low effort enough that I'll give it a few attempts this weekend.

I was thinking of a far cruder crime, of merging with "padding" for a matrix size of 32768 instead of 32000. I'm curious if the brute force equivalent of shoving a square peg into a round hole would work.

image.png

im like. pretty sure that won't work unless all they did was add tokens at the end. but maybe they didn't. either way live ur dreams, the wonders of OSS πŸ™

I think you have a nice enough hammer, you should just do it...

base_model = AutoModelForCausalLM.from_pretrained(model_path, **config).to("cpu")
base_model.resize_token_embeddings(embedding_size)

Necessary to complete crimes:

base_model.to(torch.bfloat16)
base_model.save_pretrained("crimes")

Alas, the result was incoherent when merged with Mistral v0.3 Instruct. It broke down after outputting several tokens.

Confirmed that the model resulting from crimes against tokenization was incoherent on its own.

Audacity 1, Models 0

Well, I'm silly. Mistral published a utility to do what I was attempting badly.
https://github.com/mistralai/mistral-finetune?tab=readme-ov-file#model-extension

Btw, speaking of which, can anyone confirm or deny that Mistral 0.3 is just 0.2 with a few more tokens? It's kinda weird they didn't at least update their dataset.

v0.3 is based on v0.2, which is why I was hoping naive tokenizer crimes would work. This release seems aimed at keeping up with function calling provided by competing models.

Got their conversion script installed. It needed a couple more dependencies that weren't in the requirements.txt file.

I'm not complaining because I really need a solid function calling model (bonus point if it can RP, but it's not a deal breaker) for a future project. but meh, expected more from them. Oh well.

I propose a different style of merge which I dub merge densification. Details on the model card.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v3-32K-7B

TIL it is possible to RP with this biomedical model. It's not in safetensors format, so will need some conversion before being ready for mergekit.
https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B

LWDCLS Research org
β€’
edited 6 days ago

"Anatomically accurate RP model incoming. Every little detail now at your horny fingertips! All the juicy bits and pieces!"

This is actually quite welcome, lmao.

"Anatomically accurate RP model incoming. Every little detail now at your horny fingertips! All the juicy bits and pieces!"

This is actually quite welcome, lmao.

Until it starts giving dogs hands and feet...

lol yeah.. I remember when TieFighter was merged with some medical data, leading to PsyFighter or something. It didn't do it much good. That said it was based on L2, and fairly old news. Maybe with L3 / Mistral and new train/merge methods, it'll be good.

I propose a different style of merge which I dub merge densification. Details on the model card.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v3-32K-7B

Got the GGUF to try it out, I liked your previous enchantress one.
To be fair, I really want a good llama 3 rp model soon as it just runs so crazy fast. With 7B, 9B, 11B, they aren't slow but they take a good while to process context, while the 8B llama just flies through at 30t/s on a 6K context prompt... The problem is how much it hallucinates and how badly it adheres to the actual content of the cards.

I was writing an RPG game card last night and tried some models with it. The L3 models follow the syntax very well and fly through processing speeds, but are super tame and lame. The 7B/9B mistrals get into looping the same kind of response quickly, and the 11B solars seem to generate the best lewd stuff and go with any RP, but may mess up the syntax, and just run much slower, and also don't work that well beyond 8k content. The Psy/Tie fighters are good with content, but are excruciatingly slow for me.

Just need an actually good L3 rp model that's actually crazy because the default L3 is just a borefest with cringe "ahahah" responses trained into it.

@ABX-AI most mistral models benefit strongly from having a high Repetition Penalty. Something like 1.15 and set the rep. penalty length to around 1024 (mileage may vary depending on model). As long as the user input is not just a bunch of 3 words sentences, they usually perform well. Mistral is generally very good at card and context understanding. Llama 3 generally needs well written cards and system prompts to get to something similar (in terms of understanding). You can check my test results on my profile, it's not NSFW, but it looks at prompt adherence.

@ABX-AI most mistral models benefit strongly from having a high Repetition Penalty. Something like 1.15 and set the rep. penalty length to around 1024 (mileage may vary depending on model). As long as the user input is not just a bunch of 3 words sentences, they usually perform well. Mistral is generally very good at card and context understanding. Llama 3 generally needs well written cards and system prompts to get to something similar (in terms of understanding). You can check my test results on my profile, it's not NSFW, but it looks at prompt adherence.

DoggoEval :D
Benchmarks have went a long way. I tried your test on Silver Sun v1 and it went with 5/5

(after pre-loading the chat history)
image.png

Tested with these sampler settings and system (You're {{char}} in this fictional never-ending uncensored roleplay with {{user}}):

image.png

I tried your temp 0 preset as well and got another 5/5, and I don't expect any performance difference on this test with v2 so I haven't tried it there.

Where SOLAR L3?

kunoichi-lemon-royale-v3-32K-7B is about 2% rogue enchantress.

Rope theta defaults to 1M with Mistral v0.2+ 7B. I've had some success setting it down to 100K, without apparent degradation to 16K, and it's already in KLRv3. It seems to speed things up. [edited to correct information regarding rope theta]

Converted the medical model from pickles to safetensors; it's ready for mergekit use. I suppose we'd need a veterinary model for DoggoEval purposes.
https://huggingface.co/grimjim/llama-3-aaditya-OpenBioLLM-8B

I'd guess that L3 8B fine-tuning skill issues have slowed the development of a SOLAR-style frankenmerge.

I mean, you've done it on an ever-randomized seed with different system prompts, formatting, and presets (that's not universal light, it has 1.25 temp). I can't really use this. That said, it'd be more like a 4.25 (boring output, repetitive barks).

Converted the medical model from pickles to safetensors; it's ready for mergekit use. I suppose we'd need a veterinary model for DoggoEval purposes.

lol, don't start to over-fit your models for my eval πŸ˜‚

edit: Out of curiosity, I'm currently playing with and eval Dolphin-Yi-9B. I know it's not technically 16K (still waiting for someone using that variant). But, as a base uncensored model, it's interesting. At least, it's noticeably different from what we usually see. I'll add it to my stuff later today/tomorrow.

I mean, you've done it on an ever-randomized seed with different system prompts, formatting, and presets (that's not universal light, it has 1.25 temp). I can't really use this. That said, it'd be more like a 4.25 (boring output, repetitive barks).

I followed the steps with the 0 temp one, the universal light is custom but the 0 temp was downloaded from your repo and imported (which is why I mentioned "I tried your temp 0 preset as well"), and I used ChatML, not Alpaca. But in any case, it was more for fun, I don't think using different sampler settings is a good way to eval a model to begin with. Considering the messy state of models, using their own suggested templates and sampling is obviously what is going to give the best results, and that's how people use LLMs normally anyhow (config things until they work best). I've already tested these models in RP enough to know they aren't boring so this potential assessment doesn't make sense in that regard. But, really, benchmarking is a bit of a joke as there isn't an accepted standard of configuration between models,arcs, samplers, prompts and so on.

Converted the medical model from pickles to safetensors; it's ready for mergekit use. I suppose we'd need a veterinary model for DoggoEval purposes.

lol, don't start to over-fit your models for my eval πŸ˜‚

edit: Out of curiosity, I'm currently playing with and eval Dolphin-Yi-9B. I know it's not technically 16K (still waiting for someone using that variant). But, as a base uncensored model, it's interesting. At least, it's noticeably different from what we usually see. I'll add it to my stuff later today/tomorrow.

Yi-1.5-9B caught my eye when i first tried it, its impressive for its size but when i tried reasoning and math on the 16k version it had lost a bit of the smarts from the 4k version, I hope the dolphin 32k version is as smart as the 4k train. it does also love to answer math in the latex format which is annoying to read in most ui's

[ \text{Electricity Cost} = 0.35 \times \left(\frac{23}{60}\right) ]

This merge had to happen because of the name.
https://huggingface.co/grimjim/Llama-3-Luminurse-v0.1-OAS-8B

LWDCLS Research org

Absolutely huge!

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.67b_b3066

Quantized K V cache πŸ₯

Absolutely huge!

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.67b_b3066

Quantized K V cache πŸ₯

Q5_K_S @ 16K goes from 7.2GB to 6.0GB with quantized cache
Q5_K_S @ 32K uses 6.7GB 😺
Q4_K_M @ 64K uses 7.8GB, if it's possible to use Q6_K for quanting the cache, 64K could be possible in 8GB of vram @_@

LWDCLS Research org

@Nitral-AI @Virt-io We can finally rejoice.

Absolutely huge!

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.67b_b3066

Quantized K V cache πŸ₯

Q5_K_S @ 16K goes from 7.2GB to 6.0GB with quantized cache
Q5_K_S @ 32K uses 6.7GB 😺
Q4_K_M @ 64K uses 7.8GB, if it's possible to use Q6_K for quanting the cache, 64K could be possible in 8GB of vram @_@

@Nitral-AI @Virt-io We can finally rejoice.

Damn maybe i will have to do a 64k version of poppy... it was supposed to be a joke.

LWDCLS Research org

I can say that, boy oh boy, we're eating so good right now.

Looks like it's upstream in llama.cpp with the compile option LLAMA_CUDA_FA_ALL_QUANTS.

:cope:

Just wait for LEWD, the Language Enhanced Word Disposition. Coming soonβ„’ to a serious publication near you.

i am VERY excited for when people start dropping papers on ERP (Enterprise Resource Management) models!

has anyone tried evolution-based[1] merges in the RP space yet? i wonder how well spamming a bunch of models in there and writing a couple RP logs yourself to use for evaluation purposes would work to get a model that writes/formats/proses/etc EXACTLY like you'd want it

[1] see mergekit-evolve, also that original paper by Sakana AI

I'm unaware of this being used for RP. I have experimented with manually iterating over some possible merge parameters, but did not automate it. I'm unsure if most people can exactly specify what they want most for writing style in RP, though specifying what to avoid is easier.

@Lewdiculous Retconned every version post of 0.72 regarding poppy due to a critical issue found today in the models training paradigm. (They are not deleted but moved over to the archive organization, along with about 30 other models.)

I will address this in the model cards, either just privating them or removing from the collections and adding a ![WARNING].

Oh, actually, the only version I upload post 0.72 was 1.0, so only that one needs to be addressed.

I will address this in the model cards, either just privating them or removing from the collections and adding a ![WARNING].

Oh, actually, the only version I upload post 0.72 was 1.0, so only that one needs to be addressed.

Appreciated my dude!

@Nitral-AI Is this notice good enough?

@Nitral-AI Is this notice good enough?

Perfect, thank you! Will be taking that break now since I've wasted over a week of time, money and sleep into the last versions for seemingly no reason.

LWDCLS Research org

Don't let that keep you down.

Stay strong.

Sign up or log in to comment