Lewdiculous/Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix

Lewdiculous

Owner May 7

•

edited May 7

Leave your feedback for issue or questions here.

Author feedback can be left in their original page.

Lewdiculous pinned discussion May 7

deleted

May 7

Awesome job on the quants! Just as always!

TheVisitorX

May 7

I just downloaded the exl2 version. Unfortunately, my 8GB card is already pretty at its limit. So I searched for gguf files and quickly found yours. First impressions with the exl2 version were pretty good. I will now test the gguf and compare the results. Thank you very much for making the effort again!!

Lewdiculous

Owner May 7

•

edited May 7

@TheVisitorX Q4_K_M-imat will serve you well!

If you need a bit more room, perhaps for some more context, the Q4_K_S-imat should perform well.

For automatic RoPE scaling to be sure higher contexts are stable, as recommended, please use KoboldCpp 1.64 or higher if available.

Q4_K_S:	4.57 BPW
Q4_K_M:	4.83 BPW

Personally, I'm very happy with this model.

Sovy

May 8

•

edited May 8

Probably not the best place to ask this, but since quants differences were brought up, is it better to use Q4_K_S with more context? or use Q4_K_M with a bit less?

My dear 2060 has been doing very well with 4_K_M at 4k context (and does seems fine at 6k as well), but I've been wondering. I'm sure I can trade off more process/gen time for larger context but I've been curious about this since I know quants are lossy.

Virt-io

May 8

@Sovy

Is speed that important?

I also have a 2060 6GB and I use Q8_0 and Q5_K_M at 8K context.
I wouldn't go lower than 4 BPW

Not all layers are offloaded, but honestly it is not that slow. Even Q8_0 is fast enough, about reading speed.

Are you using an older CPU with really slow ram?

Sovy

May 8

•

edited May 8

@Virt-io
Nah, my CPU is fairly recent, a i5-12600k. Ram is 3600mhz DDR4. Just can't easily justify a GPU upgrade at 1080p otherwise (and the gpu drought that happened).

I haven't tried with larger quants, I got interested in (local) LLMs fairly recently and happened to start looking into it right before the Llama 3 8b finetunes started dropping, so I'm fairly green still. I basically just tried to find the optimal vram usage first, since I'm still tinkering and getting back into the community and glancing over the research. Pretty much chose the Q4_K_M's because it seemed to be the most balanced and least likely to make my PC explode as a starting point.

I get a little impatient with BLAS processing while testing stuff out, but as I get more familiar with the models its going to be less of a big deal as I see what they can do with lorebooks, etc.

Virt-io

May 8

•

edited May 8

@Sovy

I have a i7-10700f

For a 12th gen Intel CPU, I think you need to restrict threads to only use P-cores.

Sovy

May 8

@Virt-io
I was going to bring up something about that, but I had decided to remove it from my response... ugh. I'll look into that. It's also good to know that I should be able to let things rip, thanks for letting me know.

leechao2169

May 8

Responses are great but almost all of them are around 100 tokens even though I've set response token limit to 250. I've been using the same presets from Virt for the last few weeks and I didn't have this issue with Poppy. Anyone know how to make responses closer to token limit?

Lewdiculous

Owner May 8

•

edited May 8

@Virt-io might recommend something to change in the presets but the response length can also be related to the character cards, how their original message is, and how the Example Messages are.

My responses were closer/around 200 tokens when testing.

Virt-io

May 8

•

edited May 8

@leechao2169

Would you download the newest ones(v1.7)? I updated them without changing the version.

Also, like @Lewdiculous stated, make your 'First message' long and include some 'Examples of dialogue'.

Set the max tokens to 512, and it does seem that it doesn't like generating long responses. Try the continue button for now. I'll see if I can get it to write even longer responses.

Lewdiculous

Owner May 8

•

edited May 8

I'm quite partial in this matter as I prefer 150-200~ token generations so for most "normal dialogue" cards, so for me it's quite good. But it's important to share all feedback like this in the threads as it helps others know of it will be good for them as well.

Ardvark123

May 11

Just leaving feedback. I have to say, this one has outdone itself. While I did have the occasional issue of it wanting to end the story, a quick regen fixed that. I went in well past the context limit and it still was pretty coherent until I could wrap up a ending point to start a new chat to continue.

Lewdiculous

Owner May 12

Undi and IkariDev cooking straight gold with this release.

Lewdiculous

Owner May 12

•

edited May 12

Version (v2) files added! With imatrix data generated from the FP16 and conversions directly from the BF16.
This is a more disk and compute intensive so lets hope we get GPU inference support for BF16 models in llama.cpp.
Hopefully avoiding any losses in the model conversion, as has been the recently discussed topic on Llama-3 and GGUF lately.
If you are able to test them and notice any issues let me know in the discussion.

New files are uploaded.

Ardvark123

May 12

I will test it out, let you know how it goes!

Lewdiculous

Owner May 12

•

edited May 12

I did it since there were reports of lossy conversion from how it was done before from the Reddit threads, this should be better, but the difference might be so small it doesn't change much, which is better than being worse at least.

Ardvark123

May 13

•

edited May 13

Update! Truthfully, it seems to be a little faster, keep in mind I am well beyond context, but I am noticing not having to wait as long. Performance wise it feels about the same, maybe less ending or trying to end the story and more and what do they do next? tune in next timestuff. All in all pretty solid IMO. Keep in mind, I may not even be your target audience computer wise. I run a Ryzen 5 3600x, 32gb ram, and a RX 580 for display and games but for this I grabbed one of those cheap Tesla p40s. I keep it cool for a tiny thing that costed me like 8 dollars.

Lewdiculous

Owner May 13

•

edited May 13

It should be a marginal improvement but I'm glad it's working as intended. Seems like we can finally have GGUFs of satisfactory quality for Llama-3. Hopefully... We'll see with the smaller recommended quants, like the Q4s.

Respect on driving that P40!

TheVisitorX

May 13

•

edited May 13

Unfortunately I still have a lot of issues with the GGUF version (new and old). I'm using the Q6_K quantization, because I wasn't really satisfied with Q4. At first, all is fine, but after some time the AI begins to get more and more repetative. I always have to edit the responses to 'fix' that. It takes certain phrases and repeats that over and over again. Sometimes it decides to count on the beginning of each sentence or it always uses the same emoji or phrase. It's so annoying. But as soon as I switch to the exl2 version (and with only 4BPW) it suddenly no longer occurs. Settings are exactly the same for both. I have tried so many things. Changing rep. penalty, temperature, penalty range, min-p. Nothing seems to help (only editing the responses helps for some time).

Update:
I have now adjusted the settings slightly again and edited the responses. Looks like it works a bit better with the old version, but it could just be a coincidence. I'll keep an eye on it...

Lewdiculous

Owner May 13

•

edited May 13

Sorry if have forgotten but have you been using [Virt's presets]? There was effort put into avoiding the repeating like that, strangely this issue it wasn't something I ever experience in my 120 turn conversation. Not sure if it's a backend thing? I assume usage with the latest KoboldCpp. If you were using Ooba at the time, that could have been part of it. Thanks for testing.

There shouldn't be any worse about v2, it should be the same or a bit less lossy but it's good to keep an eye out.

Gr3yMatter

Jun 5

@Lewdiculous Thank you for this. Earned yourself a follow. Keep up the good work.

ByblosHex

Aug 9

There are a lot of different presets in the recommended presets page, can someone help identify which presets in particular are intended to be used with this model?

Lewdiculous

Owner Aug 9

•

edited Aug 9

@ByblosHex

These should be okay...

[Simple]Roleplay.json
Context and Instruct prompts v1.9

Lewdiculous
/

Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix

General discussion and feedback.