FantasiaFoundry/GGUF-Quantization-Script · [SOLVED] Temp folder littering

Mar 31

•

Could this script be littering Temp with new cublas (another other reqs.) versions on every run? I had ~300GB in there last time I checked after going wild with quantization 😅
Correct me if I'm wrong, it could actually be some other script or app doing that, so my bad if that's the case, but my C tanked soon after I started merging so it may be related. If you have any ideas what may be doing it if not this, let me know <3

FantasiaFoundry

Owner Apr 7

•

edited Apr 7

Ah, so it's the huggingface_hub cache causing it. It's caching all downloaded models to the C:\Users\{{User}}\.cache\huggingface\hub folder in the C drive. This can be useful or a hindrance, for example if you try to convert the same model again later and it's in the cache it won't have to be downloaded again, it will just be symlinked and used for quantization. But it can use disk space heavily if you're just doing model after model without checking it. On my end I have a manual PowerShell command using rmdir "C:\Users\{{User}}\.cache\huggingface\hub" aliased in my Terminal to "rmhfcache" which I run after I'm done with conversions for the day to clear that up...

From their documentation:
https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup

I suppose you can add something like $env:HF_HOME = 'YOUR_NEW_HF_PATH' to your Terminal $Profile file and it will be set upon its launch.

Added relevant notice to ReadMe.

FantasiaFoundry

Owner Apr 7

@Lewdiculous

ABX-AI

Apr 7

•

edited Apr 7

That's another one, I noticed this one before writing the thread (and I clean it up), but there is something else happening with cublast, and IDK if it's this script doing it, or something else, this is 105GB of cublast DLLs:

I think it could be... kobold cpp doing this on every model launch?! A bit crazy but it may not be this script as the script only downloads changes to llama cpp. It could also be mergekit somehow doing it... I'm not sure, but it's crazy littering xd

And these are some actual hub cache models, another category

Lewdiculous

Apr 7

•

edited Apr 7

Cuda DLLs are checked if they already exist in the bin folder, and their download is skipped if they are already there, basically they should only be downloaded in the first time using the script. Interesting observation with the temp folder.

Okay, so are these the files you're seeing?

This really seems like a KCPP thing, not related to model quants. It might be a good idea to raise an issue about it - my screenshot looks good enough, and link it here so we can boost it.

Running a model now gave me another one already:

ABX-AI

Apr 7

I think you are right, I also observed kobold-related naming in some of these folders, and YES, they are named exactly like you showed. Thanks a lot for the support on investigating this, I will raise it in their discord <3

ABX-AI

Apr 7

Opened an issue: https://github.com/LostRuins/koboldcpp/issues/768

saishf

Apr 10

Opened an issue: https://github.com/LostRuins/koboldcpp/issues/768

It's fixed! I got 30GBs of storage back :3
And maybe now my bandwidth usage will fall? i hope at least

ABX-AI

Apr 10

Opened an issue: https://github.com/LostRuins/koboldcpp/issues/768

It's fixed! I got 30GBs of storage back :3
And maybe now my bandwidth usage will fall? i hope at least

WE DID IT, REDDIT!!! :d

Just noticed when concedo returned and made the new build :) Nexesenex also has that in his latest forks.

Now, for step 2: beg for adding Ampere builds with Cublas 12.3 into releases because it's so much faster on rtx 3000 :S
In some cases I have massive output and overall t/s improvements with Nexesenex's 1.59 fork with cublas 12.3

saishf

Apr 10

Opened an issue: https://github.com/LostRuins/koboldcpp/issues/768

It's fixed! I got 30GBs of storage back :3
And maybe now my bandwidth usage will fall? i hope at least

WE DID IT, REDDIT!!! :d

Just noticed when concedo returned and made the new build :) Nexesenex also has that in his latest forks.

Now, for step 2: beg for adding Ampere builds with Cublas 12.3 into releases because it's so much faster on rtx 3000 :S
In some cases I have massive output and overall t/s improvements with Nexesenex's 1.59 fork with cublas 12.3

1.62.1 is now defaulting to using mmq for me, which causes outputs to run at 13~t/s
1.61.2 outputs at 20t/s. Because it isn't defaulting to using mmq

Model is Fimbulvetr v2 fully offloaded into vram

saishf

Apr 10

Also Nexesenex's fork doesn't help with turing, at least not for my use case.
Couldn't get 30 series at the time i bought a gpu, no stock :')

ABX-AI

Apr 10

•

edited Apr 10

Opened an issue: https://github.com/LostRuins/koboldcpp/issues/768

It's fixed! I got 30GBs of storage back :3
And maybe now my bandwidth usage will fall? i hope at least

WE DID IT, REDDIT!!! :d

Just noticed when concedo returned and made the new build :) Nexesenex also has that in his latest forks.

Now, for step 2: beg for adding Ampere builds with Cublas 12.3 into releases because it's so much faster on rtx 3000 :S
In some cases I have massive output and overall t/s improvements with Nexesenex's 1.59 fork with cublas 12.3

1.62.1 is now defaulting to using mmq for me, which causes outputs to run at 13~t/s
1.61.2 outputs at 20t/s. Because it isn't defaulting to using mmq

Model is Fimbulvetr v2 fully offloaded into vram

This is not what's happening on my end, because the Ampere exe I am using is ALSO defaulting on mmq 😅 I do the A/B tests with it enabled and still see the improvement of the Cublas 12.3 fork. I'm seeing actual faster output generation which should not be related to mmq (mmq is about prompt processing, not generation)

PS: I cannot even get my models to start processing without MMQ. Disabling it completely breaks it for me, not sure if it's related to the RTX 3070, but that's my reality :/ The speed improvement on 12.3 is considerable for me to the point that I just use only GGUFs I trust with the outdated 1.59 fork with that cublas version.

saishf

Apr 10

That's why I'm confused, I'm unsure as to why mmq is affecting generation speed in both the official and the fork exe by a huge amount
I'm going to try installing the latest cuda I can, hopefully stuff doesn't break like last time :3

ABX-AI

Apr 10

•

edited Apr 10

Check it out, it's basically double speed o_O
Same quant, both fully into GPU. Same prompt and char card. On 12.4 I may get a fast speed at first, but then in goes SLOWMODE. On 12.3 it's consistently fast. Slower with bigger context, but this is just an A/B test here.

saishf

Apr 10

I did testing kinda?

I do wonder if it's because I'm using an IQ3_M quant because i don't find anything else wrong?
I'm on the latest cuda now, with a 2080 using Fimbulvetr, 8K context in vram
Completely lost :>

saishf

Apr 10

mmq is an upstream feature can be added to --usecublas to use quantized matrix multiplication in CUDA during prompt processing, instead of using cuBLAS for matrix multiplication. Experimentally this uses slightly less memory, and is slightly faster for Q4_0 but slower for K-quants. However, for long prompts on new GPUs, cuBLAS is generally faster at the cost of slightly more VRAM (MMQ off).

This is the description for MMQ in the koboldcpp wiki, but if anything now i'm more confused at how the difference can be so big over a context loader change😿

ABX-AI

Apr 10

•

edited Apr 10

IQ quants are weird, and I find them illogical in some tests.
For example, I can run and start outputting with 16k context on an IQ4_KM quant of an 11B. The same model, with 16k context on IQ3_XSS will not start outputting at all. It lags my whole system and vram overflows. I'm not sure why, shouldn't it take less space to do that vs Q4_KM which starts outputting and does not overflow? The fix is to not offload the entire iq3_xss quant. Once I loaded it partially to gpu, it was outputting again.

Also, not sure which nexesenex you tried, but the one with the ampere fork is this: https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.59c_b2249
The following ones don't have this driver variant, I'm pretty sure. Furthermore, you will likely not see any difference like I do, because you are not on Ampere (rtx 3000) but Lovelace (rtx 2000). So I wouldn't suggest it for you to begin with.

For MMQ... can't say, but the information itself states speed changes may vary. As you see, for me, disabling it won't even start outputting at all :D For you, disabling it is an improvement. Maybe because I use K-quants? Haven't tried turning it off on IQ or non-k tbh.

ABX-AI

Apr 10

•

edited Apr 10

Oh yes. It's definitely because of the K-quant. I tried mmq OFF on IQ3_XSS and it flies.

prompt processing goes brrrr o_o

saishf

Apr 10

•

edited Apr 10

The prompt processing speed is how i could tell something was wrong

Both Q5_K_M Hermes Pro 2 as a test

Nearly halved my 7b performance, MMQ be beeg hungry for turing gpus :3
Edit! MMQ is automatically turned off when selecting Lemonade RP IQ4_XS, but when selecting Fimbulvetr IQ3_M it keeps MMQ.

ABX-AI

Apr 10

Try it off with a non-K quant. Some IQ or normal Q, looks like the K-quants may be the problem. All these quants with their own quirks lmao

saishf

Apr 10

•

edited Apr 10

Q5_0 Hermes Pro 2 MMQ

When it thinks the model is too large and automatically selects less layers than the max it enables MMQ

I'll mess around with it more tomorrow cause it's kinda fun, but it's 3am so i should sleep :3

ABX-AI

Apr 10

@saishf thanks for noting this, I hadn't noticed. I swap models so often when testing merges or settings that at one point it just all blends in. Generally, there are too many variables... And then you have sampler settings and sampler order, which can also affect performance severely, eg wrong sampler order slows me down quite a bit.

BTW, in the kobold discord I see people recommending to keep MMQ off very often (although it may be also based on what GPU). Don't overthink it, run whatever is the fastest and enjoy ^^

I just want my ampere exe but at the same time, I'm eventually going to get a better card and hopefully not have to min-max so hard. Also, 1bit era when? Llama 3 is coming next week (small models first), and this week we got: (copy pasting from reddit)

StableLM_2-12B: https://huggingface.co/stabilityai/stablelm-2-12b

CodeGemma-7B: https://huggingface.co/collections/google/codegemma-release-66152ac7b683e2667abdee11

Recurrent Gemma: https://huggingface.co/collections/google/recurrentgemma-release-66152cbdd2d6619cb1665b7a
-> Check this tweet for more info https://twitter.com/IParraMartin/status/1777721730959593976

Mistral 8x22B with 64K context window
-> Check this tweet for more information: https://twitter.com/danielhanchen/status/1777912653580771674

GPT4 Turbo with Vision

Gemini 1.5 Pro with Audio

Can't wait for llama 3 as well.

Lewdiculous

Apr 11

GUI is bloat.

Heya fellas, happy to see it solved!

saishf

Apr 11

•

edited Apr 11

I just want my ampere exe but at the same time, I'm eventually going to get a better card and hopefully not have to min-max so hard. Also, 1bit era when? Llama 3 is coming next week (small models first), and this week we got: (copy pasting from reddit)
StableLM_2-12B: https://huggingface.co/stabilityai/stablelm-2-12b

CodeGemma-7B: https://huggingface.co/collections/google/codegemma-release-66152ac7b683e2667abdee11

Recurrent Gemma: https://huggingface.co/collections/google/recurrentgemma-release-66152cbdd2d6619cb1665b7a
-> Check this tweet for more info https://twitter.com/IParraMartin/status/1777721730959593976

Mistral 8x22B with 64K context window
-> Check this tweet for more information: https://twitter.com/danielhanchen/status/1777912653580771674

GPT4 Turbo with Vision

Gemini 1.5 Pro with Audio
Can't wait for llama 3 as well.

I feel 1bit will be most important with moe models. Being able to fit big moe models into small amounts of vram would be awesome, because the ram required is so big yet the compute is so little.

saishf

Apr 11

GUI is bloat.

Heya fellas, happy to see it solved!

Cli is hard for me because I'm usually remoting into my pc with my phone, so typing is rather hard, I use the phone keyboard and it covers half of the screen. Whereas I can use the gui without much effort
If cli was pretty and had autocorrect, then I'd understand it :3

Lewdiculous

Apr 11

•

edited Apr 11

@saishf

My Terminal is pretty. Umpf!

Maybe this can help.

You need PSReadLine for PowerShell.

I just type "kobold" and swipe through the history. If it's a model you already ran you can just type a part of its name and hit enter to use the previous command, takes only a couple seconds that way.

Also look into Zoxide – there are some good YouTube videos about it – to replace your cd command, being able to to do a cd models rp fav from anywhere to get where I need to go for example is great.

You can also just make PowerShell scripts if your use case involves the same models frequently and use Set-Alias to get convenient commands for each. I have a few easy to type commands to quickly launch everything I need like that.

If you're gonna use KoboldCpp from the CLI, make sure you're installing it with Scoop.sh, so you get the koboldcpp command available globally and updates are handled with a simple scoop update --all.

ABX-AI

Apr 11

What's the point with Kobold, though, considering the GUI only lasts until the terminal loads? There is no GUI taking up vram while it runs

saishf

Apr 12

What's the point with Kobold, though, considering the GUI only lasts until the terminal loads? There is no GUI taking up vram while it runs

Because it makes you look cooler! Plus it makes it easier to use llamacpp if you ever need to, in the scenario a custom llamacpp fork supports a model koboldcpp doesn't yet.
It happened with Command R
Plus one thing I learnt is learning fluency with the cli is super helpful sometimes, specifically in formatting drives in weird formats.

saishf

Apr 12

@saishf

My Terminal is pretty. Umpf!

Maybe this can help.

You need PSReadLine for PowerShell.

I just type "kobold" and swipe through the history. If it's a model you already ran you can just type a part of its name and hit enter to use the previous command, takes only a couple seconds that way.

Also look into Zoxide – there are some good YouTube videos about it – to replace your cd command, being able to to do a cd models rp fav from anywhere to get where I need to go for example is great.

You can also just make PowerShell scripts if your use case involves the same models frequently and use Set-Alias to get convenient commands for each. I have a few easy to type commands to quickly launch everything I need like that.

If you're gonna use KoboldCpp from the CLI, make sure you're installing it with Scoop.sh, so you get the koboldcpp command available globally and updates are handled with a simple scoop update --all.

Thank you! I did hate cd-ing into everything, especially cause my files are across different drives.
I've also seen cheat on github which seemed interesting but it's not officially supported on windows
And I figured out how to change my CMD colours to whatever I want because windows terminal thing, the win 11 one.

ABX-AI

Apr 12

What's the point with Kobold, though, considering the GUI only lasts until the terminal loads? There is no GUI taking up vram while it runs

Because it makes you look cooler! Plus it makes it easier to use llamacpp if you ever need to, in the scenario a custom llamacpp fork supports a model koboldcpp doesn't yet.
It happened with Command R
Plus one thing I learnt is learning fluency with the cli is super helpful sometimes, specifically in formatting drives in weird formats.

Can't argue with that. I'm just having enough of cli usage at my work and it would legit take me far longer to launch kobold with cpp when testing settings and swapping model after model ^^ but if you have a a couple of main setups, yeah that would be a cleaner way to launch it :P

BTW, Nexesenex helped me out with making a new Ampere build but the difference is still not as big as that old 1.59 build he did

Super weird, not clear what makes that x2 speed on the old build with certain quants... Conclusion was for me to update the video drivers and see if it's still slower with new build.

saishf

Apr 12

Are you sure Q4_K_M isn't loading into vram swap?
IQ4_XS I get 20T/s on and older gpu. With similar fp16 performance, but you have new cublas gains
The quant doesn't seem to change performance that much for me
33% different performance in a 0.4~gb file size difference

saishf

Apr 12

Also beeg cpu, quanting must be fast!

ABX-AI

Apr 12

Not sure how fast quanting "should" be, it takes me a while but then again for a model like 11B it comes out to ~100GB new files ^^ I'm using cuda with the script as suggested with 13 layers or so (more = overflow with 7b+ sizes).
As for the speed... I'm really not sure at this point. I do get different performance based on model on quant, they are not all the same in performance in my experience. With this old build I have seen x2 on some quants and not on others (and this example above is pretty much almost x2 speed)

saishf

Apr 12

•

edited Apr 12

The differences between quants get so confusing 😭
Like Q8_0 is so much faster than Q4_K_M but near twice the size when quanting
And mmq is slower with _M but not _0 (wiki).
Confuzzled 💫
Edit - forgot some words

ABX-AI

Apr 12

I'm wondering if my case is related to the GGUF quant being reported as llama and that making it faster, since the solars are all llama-based on arc level, and then I see no difference with some 9B mistrals. All of these are all tested fully offload on the GPU as well. Eh, what can I say, open LLM is as fun as it is confusing :D

saishf

Apr 12

•

edited Apr 12

I'm wondering if my case is related to the GGUF quant being reported as llama and that making it faster, since the solars are all llama-based on arc level, and then I see no difference with some 9B mistrals. All of these are all tested fully offload on the GPU as well. Eh, what can I say, open LLM is as fun as it is confusing :D

Secrets. Solar uses mistral weights (it's probably why it performs so well) :3

We present a methodology for scaling LLMs called depth up-scaling (DUS) , which encompasses architectural modifications and continued pretraining. In other words, we integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.

They just used the config for llama due to the lack of sliding window from what I understand.
I believe you could replace "llamaforcasuallm" with "mistralforcasuallm" and have llamacpp report it as a mistral model
Edit - Remove breaky stuff

ABX-AI

Apr 12

•

edited Apr 12

Nah, not the case. Check their paper ^^

It's not even a secret at all, the SOLAR paper explains quite well what they did, the weights are Mistral, but the base arch is LLAMA. The config needs to be llamaforcasuallm, or it will result in potential issues (people complained about my first solar merge because it was acting weird due to this issue). There is some issue with stupid output if you use mistral config. However, it's unfair to call it a straight-up Llama model. Same like it's unfair to call it a straight-up mistral model. Arc is ultimately llama at the very bottom (which is compatible with mistral).

Full paper on solar upscaling: https://arxiv.org/pdf/2312.15166.pdf

(oh, and of course, they added 3B of training in the newly created layers which gave it the improved performance, and tbh the uncensored solar is the most unhinged rp model I have found, I've mostly just been using the silver-sun model. PS, @saishf I also have v2 on that which uses your kuro-lotus-fimb but with fimb v2, however the difference is minor and I feel like the v1 is a bit faster for me so I go with it)

saishf

Apr 12

Nah, not the case. Check their paper ^^

It's not even a secret at all, the SOLAR paper explains quite well what they did, the weights are Mistral, but the base arch is LLAMA. The config needs to be llamaforcasuallm, or it will results in potential issues (people complained about my first solar merge because it was acting weird due to this issue). There is some issue with stupid output if you use mistral config.

Full paper on solar upscaling: https://arxiv.org/pdf/2312.15166.pdf

Are llama1 and llama2 the same arch?
Solars' config uses model_type: llama, how does llamacpp tell the difference between 1 and 2?
Is llama2 just further training over 1?
And does that mean if llama3 is a new arch it'll break llamacpp and koboldcpp 😭

ABX-AI

Apr 12

•

edited Apr 12

I'm pretty sure Llama 1 and 2 use the same architecture (from what I have read here and there), or with minor differences. I sincerely hope llama 3 won't break anything, but the devs are so fast, that I doubt it would be an issue. My fellow Bulgarian ggerganov does like 5 commits per minute anyhow :D

Nexesenex does early builds too if Lost Ruins lags behind. He even made this Ampere 12.3 build for me <3

The dev community will make it work ASAP, no worries, at least that's how I feel about it, haha

Did you see how fast the new Mixtral got fine tunes? Less than a day and we have 2, one official from HF. Damn, this community is on SPEED

saishf

Apr 12

(oh, and of course, they added 3B of training in the newly created layers which gave it the improved performance, and tbh the uncensored solar is the most unhinged rp model I have found, I've mostly just been using the silver-sun model. PS, @saishf I also have v2 on that which uses your kuro-lotus-fimb but with fimb v2, however the difference is minor and I feel like the v1 is a bit faster for me so I go with it)
I tried v2 for a little and it was spicy, characters that don't usually initiate stuff with fimbu v2 did so nearly instantly. which is perfect for characters that want to commit crimes :3

ABX-AI

Apr 12

•

edited Apr 12

(oh, and of course, they added 3B of training in the newly created layers which gave it the improved performance, and tbh the uncensored solar is the most unhinged rp model I have found, I've mostly just been using the silver-sun model. PS, @saishf I also have v2 on that which uses your kuro-lotus-fimb but with fimb v2, however the difference is minor and I feel like the v1 is a bit faster for me so I go with it)
I tried v2 for a little and it was spicy, characters that don't usually initiate stuff with fimbu v2 did so nearly instantly. which is perfect for characters that want to commit crimes :3

I haven't put so much time in testing the original fimbs v1 vs v2, but with both I'm quite happier than the 7b>9b stuff I was doing. It's less prosaic and way more hardcore, and I have some pretty f-d up cards that I use. Wish I could share them since they ended up getting 1,5k downloads+ for very niche fetishes , but I don't want any connection with that account, hahaha. The solars go hard, they don't ask me "are you ready" and don't repeat themselves as much in my experience. And when going from Q4 to Q6, starting with 4k+ context, I get glorious results, the writing is just good.

Last time, I ended up entering an out of character talk with one of my unhinged cards just to tell it the rp was really good and to talk about whether the model would want to change something. Gave me some decent comments which I was already considering and said it really liked exploring such twisted scenarios xd

saishf

Apr 12

I'm pretty sure Llama 1 and 2 use the same architecture (from what I have read here and there), or with minor differences. I sincerely hope llama 3 won't break anything, but the devs are so fast, that I doubt it would be an issue. My fellow Bulgarian ggerganov does like 5 commits per minute anyhow :D

Nexesenex does early builds too if Lost Ruins lags behind. He even made this Ampere 12.3 build for me <3

The dev community will make it work ASAP, no worries, at least that's how I feel about it, haha

Did you see how fast the new Mixtral got fine tunes? Less than a day and we have 2, one official from HF. Damn, this community is on SPEED

If the rest of the world moved as fast as the devs here, we'd have flying cars and cyborgs

I'm most interested to see how the 8X22B performs on lmsys' chatbot arena. I'd like to see how it compares to models like Grok and Command-R Plus

ABX-AI

Apr 12

•

edited Apr 12

I'm pretty sure Llama 1 and 2 use the same architecture (from what I have read here and there), or with minor differences. I sincerely hope llama 3 won't break anything, but the devs are so fast, that I doubt it would be an issue. My fellow Bulgarian ggerganov does like 5 commits per minute anyhow :D

Nexesenex does early builds too if Lost Ruins lags behind. He even made this Ampere 12.3 build for me <3

The dev community will make it work ASAP, no worries, at least that's how I feel about it, haha

Did you see how fast the new Mixtral got fine tunes? Less than a day and we have 2, one official from HF. Damn, this community is on SPEED

If the rest of the world moved as fast as the devs here, we'd have flying cars and cyborgs

I'm most interested to see how the 8X22B performs on lmsys' chatbot arena. I'd like to see how it compares to models like Grok and Command-R Plus

I was kind of impressed the new mixtral went about 20 points higher in math (gsm8k). Some people glossed over that but it must be smarter, or I guess trained on that benchmark... But still, should eventually result in more smarts at least a bit. It's just... out of my use case. GPU poor for now

saishf

Apr 12

I haven't put so much time in testing the original fimbs v1 vs v2, but with both I'm quite happier than the 7b>9b stuff I was doing. It's less prosaic and way more hardcore, and I have some pretty f-d up cards that I use. Wish I could share them since they ended up getting 1,5k downloads+ for very niche fetishes , but I don't want any connection with that account, hahaha. The solars go hard, they don't ask me "are you ready" and don't repeat themselves as much in my experience. And when going from Q4 to Q6, starting with 4k+ context, I get glorious results, the writing is just good.

Last time, I ended up entering an out of character talk with one of my unhinged cards just to tell it the rp was really good and to talk about whether the model would want to change something. Gave me some decent comments which I was already considering and said it really liked exploring such twisted scenarios xd

I'd love to see a solar based on the llama 3 arch, how ever they trained the models just made them really good for general use and really receptive to training for rp models. Plus if they could up the base context they'd be perfect imo
I really hope that llama 3 has a high base context too, I like having higher context so i can have a higher token character card, 3-7k tokens per card because i fill them with every detail the way i like it while never removing much

ABX-AI

Apr 12

I agree, it would be pretty sad if they released it with 4k again. It should be 8k or beyond, hopefully. I'm also hoping for some improvement, I wish for something like these rp solar merges but with even more attention to little details and even weak traits in cards. Overall, just want better interaction, anything closer to a human brain. And really, hoping to get better performance overall in the smaller models between 7b and 13b, because the big ones and new hits like command r+ are just far too big for any casual user. You'd need a few 24 gig cards to load viable quants q_q

saishf

Apr 12

I was kind of impressed the new mixtral went about 20 points higher in math (gsm8k). Some people glossed over that but it must be smarter, or I guess trained on that benchmark... But still, should eventually result in more smarts at least a bit. It's just... out of my use case. GPU poor for now

People are currently working on smooshing and splitting the experts into a mini, non moe mistral 22b and say it keeps the math smarts. i don't know how it works though
Vezora/Mistral-22B-v0.1
thomasgauthier

saishf

Apr 12

I agree, it would be pretty sad if they released it with 4k again. It should be 8k or beyond, hopefully. I'm also hoping for some improvement, I wish for something like these rp solar merges but with even more attention to little details and even weak traits in cards. Overall, just want better interaction, anything closer to a human brain. And really, hoping to get better performance overall in the smaller models between 7b and 13b, because the big ones and new hits like command r+ are just far too big for any casual user. You'd need a few 24 gig cards to load viable quants q_q

I'm pretty sure they should release small, smarter llamas, they mentioned wanting to put llama3 into sunglasses as an assistant. so probably rag and multimodal support too

ABX-AI

Apr 12

Need to upgrade my sunglasses from gtx 760 to RTX 😂

That's a great idea, though, as google glass wasn't working out too well. These days with LLM, we could make nerding great again

Lewdiculous

Apr 12

•

edited Apr 12

@saishf

About the Terminal thing:

Get onboard with OhMyPosh (Windows Setup)- https://ohmyposh.dev/

List of premade themes:
https://ohmyposh.dev/docs/themes

For Windows Terminal customization it's pretty good, and you can edit the prebuilt styles/colors in their respective .json.

saishf

Apr 13

I made my terminal pink :3

I think I can copy the colours into the posh terminal

saishf

Apr 13

Can confirm nous hermes 2 8x7b can fix my broken json that changes the font for posh terminal :>

Lewdiculous

Apr 13

•

edited Apr 13

@saishf

I like just having a GIF in the background with the normal blur enabled with Wallpaper Engine in the Desktop.

Abolish the Command Prompt, embrace PowerShell 7+.

ABX-AI

Apr 13

•

edited Apr 13

The new default terminal in win 10/11 is pretty sick, the fact you can put animated videos/gifs as background without any extra tooling is also pretty sick, although that's a bit too much for me ^^

I keep it simple, but have tried posh and it's very nice if you want to go the extra mile for customization.

Lewdiculous

Apr 13

I always had issues with port 5001. I had to use port 6969. That one is flawless for me.

ABX-AI

Apr 13

I always had issues with port 5001. I had to use port 6969. That one is flawless for me.

That's my discord #ID lmao. As for 5001, it has been fine for me

saishf

Apr 13

•

edited Apr 13

@saishf

I like just having a GIF in the background with the normal blur enabled with Wallpaper Engine in the Desktop.

Abolish the Command Prompt, embrace PowerShell 7+.

I love wallpaper engine but figured it would cost performance when it's running in the background, as I'm only running my igpu for my displays.
It would also cause higher idle power I assume?
Power is expensive 😿
Edit - I reinstalled wallpaper engine, it uses 30-50% of my igpu on its own. + 30w idle power

Lewdiculous

Apr 13

@saishf Power wise it's the same as a watching a YouTube video, so you can think of that in your own pricing, I didn't notice an increase but I also only run it when no window is maximized, so if I'm doing anything in full-screen it pauses, I can just full my terminal to stop any unwanted power usage if needed.

It does take 100MB of VRAM, but if the wallpaper is a video file like mine it will share the load a different part of the GPU (Video Decode) to handle it than the ones you use for inference so it's not "that bad".

But I did test.

1: KCPP --benchmark with Wallpaper Engine running @1440p:

ProcessingTime: 11.28s
ProcessingSpeed: 362.67T/s
GenerationTime: 5.27s
GenerationSpeed: 18.97T/s
TotalTime: 16.55s

ProcessingTime: 11.30s
ProcessingSpeed: 362.09T/s
GenerationTime: 5.30s
GenerationSpeed: 18.88T/s
TotalTime: 16.60s

2: KCPP --benchmark with Wallpaper Engine paused/stopped @1440p:

ProcessingTime: 10.82s
ProcessingSpeed: 378.08T/s
GenerationTime: 5.07s
GenerationSpeed: 19.73T/s
TotalTime: 15.89s

ProcessingTime: 10.82s
ProcessingSpeed: 378.36T/s
GenerationTime: 5.06s
GenerationSpeed: 19.76T/s
TotalTime: 15.88s

So technically there was a 4.5% loss.

If you let the iGPU handle WPPEngine though it should be pretty negligible.

ABX-AI

Apr 13

Wait, didn't you say you use the terminal to avoid bloat? :D I will be needing those juicy 100mb vram nom nom

saishf

Apr 13

@saishf Power wise it's the same as a watching a YouTube video, so you can think of that in your own pricing, I didn't notice an increase but I also only run it when no window is maximized, so if I'm doing anything in full-screen it pauses, I can just full my terminal to stop any unwanted power usage if needed.

It does take 100MB of VRAM, but if the wallpaper is a video file like mine it will share the load a different part of the GPU (Video Decode) to handle it than the ones you use for inference so it's not "that bad".

The wallpaper I like, a city skyline kind of thing? Bullies my igpu

With and without wallpaper engine running, I run my terminal windows like this for easy model changes and vram monitoring from my phone
Inference in handled by the dedicated gpu so speed isn't much of a worry for me :3

saishf

Apr 13

Wait, didn't you say you use the terminal to avoid bloat? :D I will be needing those juicy 100mb vram nom nom

With vram something I've found is that Nexesenexs' koboldcpp fork uses 0.2gb of vram less than the original exe, it allows for 8k context without going into memory swap like the original.

ABX-AI

Apr 13

Wait, didn't you say you use the terminal to avoid bloat? :D I will be needing those juicy 100mb vram nom nom

With vram something I've found is that Nexesenexs' koboldcpp fork uses 0.2gb of vram less than the original exe, it allows for 8k context without going into memory swap like the original.

https://github.com/Nexesenex/kobold.cpp/pull/90#issuecomment-2051834388

He helped me investigate something like that, you can read in that discussion, but ultimately I can't replicate it with newer builds.

Lewdiculous

Apr 13

Wait, didn't you say you use the terminal to avoid bloat? :D I will be needing those juicy 100mb vram nom nom

I have my priorities well defined, as you can see a man can't compromise on everything.

saishf

Apr 13

The dedication of these devs is insane, like it must be so much harder to diagnose stuff now that there is like 20-30 different total quant types and each of their versions 😭

FantasiaFoundry

Owner Apr 15

I believe some consolidation of Quant options is planned in Llama.Cpp, to bring the number of quants down and help users choose the best performing options for them easier... Eventually. Surely. Soon™.

saishf

Apr 15

•

edited Apr 15

If llama3's release date lands according to plan llamacpp devs will probably be busy building compatibility😭

"Within the next month, actually less, hopefully in a very short period of time, we hope to start rolling out our new suite of next-generation foundation models, Llama 3,”

A week ago @_@

Edit - source

ABX-AI

Apr 15

4 days left, or riot

saishf

Apr 16

I sure hope they aren't using emu image generation for meta-ai on Instagram its uh bad 😭
Prompted with "1970's living room"

Meta-AI

JuggernautXL (SDXL)

FantasiaFoundry

Owner Apr 19

Marking as solved. Have fun!

FantasiaFoundry changed discussion status to closed Apr 19

FantasiaFoundry changed discussion title from Temp folder littering to [SOLVED] Temp folder littering Apr 19