General discussion and feedback.

#1
by Lewdiculous - opened

Leave your feedback for issue or questions here.

Author feedback can be left in their original page.

Lewdiculous pinned discussion
deleted

Awesome job on the quants! Just as always!

I just downloaded the exl2 version. Unfortunately, my 8GB card is already pretty at its limit. So I searched for gguf files and quickly found yours. First impressions with the exl2 version were pretty good. I will now test the gguf and compare the results. Thank you very much for making the effort again!!

@TheVisitorX Q4_K_M-imat will serve you well!

If you need a bit more room, perhaps for some more context, the Q4_K_S-imat should perform well.

For automatic RoPE scaling to be sure higher contexts are stable, as recommended, please use KoboldCpp 1.64 or higher if available.

Q4_K_S:	4.57 BPW
Q4_K_M:	4.83 BPW

Personally, I'm very happy with this model.

Probably not the best place to ask this, but since quants differences were brought up, is it better to use Q4_K_S with more context? or use Q4_K_M with a bit less?

My dear 2060 has been doing very well with 4_K_M at 4k context (and does seems fine at 6k as well), but I've been wondering. I'm sure I can trade off more process/gen time for larger context but I've been curious about this since I know quants are lossy.

@Sovy

Is speed that important?

I also have a 2060 6GB and I use Q8_0 and Q5_K_M at 8K context.
I wouldn't go lower than 4 BPW

Not all layers are offloaded, but honestly it is not that slow. Even Q8_0 is fast enough, about reading speed.

Are you using an older CPU with really slow ram?

@Virt-io
Nah, my CPU is fairly recent, a i5-12600k. Ram is 3600mhz DDR4. Just can't easily justify a GPU upgrade at 1080p otherwise (and the gpu drought that happened).

I haven't tried with larger quants, I got interested in (local) LLMs fairly recently and happened to start looking into it right before the Llama 3 8b finetunes started dropping, so I'm fairly green still. I basically just tried to find the optimal vram usage first, since I'm still tinkering and getting back into the community and glancing over the research. Pretty much chose the Q4_K_M's because it seemed to be the most balanced and least likely to make my PC explode as a starting point.

I get a little impatient with BLAS processing while testing stuff out, but as I get more familiar with the models its going to be less of a big deal as I see what they can do with lorebooks, etc.

@Sovy

I have a i7-10700f

For a 12th gen Intel CPU, I think you need to restrict threads to only use P-cores.

@Virt-io
I was going to bring up something about that, but I had decided to remove it from my response... ugh. I'll look into that. It's also good to know that I should be able to let things rip, thanks for letting me know.

Responses are great but almost all of them are around 100 tokens even though I've set response token limit to 250. I've been using the same presets from Virt for the last few weeks and I didn't have this issue with Poppy. Anyone know how to make responses closer to token limit?

@Virt-io might recommend something to change in the presets but the response length can also be related to the character cards, how their original message is, and how the Example Messages are.

My responses were closer/around 200 tokens when testing.

@leechao2169

Would you download the newest ones(v1.7)? I updated them without changing the version.

image.png

Also, like @Lewdiculous stated, make your 'First message' long and include some 'Examples of dialogue'.

image.png

Set the max tokens to 512, and it does seem that it doesn't like generating long responses. Try the continue button for now. I'll see if I can get it to write even longer responses.

I'm quite partial in this matter as I prefer 150-200~ token generations so for most "normal dialogue" cards, so for me it's quite good. But it's important to share all feedback like this in the threads as it helps others know of it will be good for them as well.

Just leaving feedback. I have to say, this one has outdone itself. While I did have the occasional issue of it wanting to end the story, a quick regen fixed that. I went in well past the context limit and it still was pretty coherent until I could wrap up a ending point to start a new chat to continue.

Undi and IkariDev cooking straight gold with this release.

Version (v2) files added! With imatrix data generated from the FP16 and conversions directly from the BF16.
This is a more disk and compute intensive so lets hope we get GPU inference support for BF16 models in llama.cpp.
Hopefully avoiding any losses in the model conversion, as has been the recently discussed topic on Llama-3 and GGUF lately.
If you are able to test them and notice any issues let me know in the discussion.

New files are uploaded.

I will test it out, let you know how it goes!

I did it since there were reports of lossy conversion from how it was done before from the Reddit threads, this should be better, but the difference might be so small it doesn't change much, which is better than being worse at least.

Update! Truthfully, it seems to be a little faster, keep in mind I am well beyond context, but I am noticing not having to wait as long. Performance wise it feels about the same, maybe less ending or trying to end the story and more and what do they do next? tune in next timestuff. All in all pretty solid IMO. Keep in mind, I may not even be your target audience computer wise. I run a Ryzen 5 3600x, 32gb ram, and a RX 580 for display and games but for this I grabbed one of those cheap Tesla p40s. I keep it cool for a tiny thing that costed me like 8 dollars.

It should be a marginal improvement but I'm glad it's working as intended. Seems like we can finally have GGUFs of satisfactory quality for Llama-3. Hopefully... We'll see with the smaller recommended quants, like the Q4s.

Respect on driving that P40!

Unfortunately I still have a lot of issues with the GGUF version (new and old). I'm using the Q6_K quantization, because I wasn't really satisfied with Q4. At first, all is fine, but after some time the AI begins to get more and more repetative. I always have to edit the responses to 'fix' that. It takes certain phrases and repeats that over and over again. Sometimes it decides to count on the beginning of each sentence or it always uses the same emoji or phrase. It's so annoying. But as soon as I switch to the exl2 version (and with only 4BPW) it suddenly no longer occurs. Settings are exactly the same for both. I have tried so many things. Changing rep. penalty, temperature, penalty range, min-p. Nothing seems to help (only editing the responses helps for some time).

Update:
I have now adjusted the settings slightly again and edited the responses. Looks like it works a bit better with the old version, but it could just be a coincidence. I'll keep an eye on it...

Sorry if have forgotten but have you been using [Virt's presets]? There was effort put into avoiding the repeating like that, strangely this issue it wasn't something I ever experience in my 120 turn conversation. Not sure if it's a backend thing? I assume usage with the latest KoboldCpp. If you were using Ooba at the time, that could have been part of it. Thanks for testing.

There shouldn't be any worse about v2, it should be the same or a bit less lossy but it's good to keep an eye out.

@Lewdiculous Thank you for this. Earned yourself a follow. Keep up the good work.

Sign up or log in to comment