General discussion and feedback thread.

by Lewdiculous - opened Feb 25, 2024

Discussion

Lewdiculous

Owner Feb 25, 2024

This is a general discussion thread. Feel free to share anything or report any issues

Nitral-AI

Feb 25, 2024

•

edited Feb 25, 2024

Test157t/Kunocchini-1.2-7b-longtext (Benchmarks are in prep now.)

Lewdiculous

Owner Feb 25, 2024

Noice!

KCPP 1.59 released but the IQ3_S support wasn't merged yet it seems. Will still add as it will be in the next version but will also add the old Q3_K_S for now.

Tibbnak

Feb 27, 2024

•

edited Feb 27, 2024

Are you sure this was configured correctly?
n_yarn_orig_ctx should be 8192, freq_base_train should only be 10000, Rope's linear scaling factor should be 16.

At least, according to the original Yarn model this merge has.

https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k/blob/main/config.json

Your newer test's config doesn't look set up correctly either.

Nitral-AI

Feb 27, 2024

@Tibbnak No they were all configured incorrectly, thank you for pointing that out.

Nitral-AI

Feb 29, 2024

I haven't had the chance to test with the updated config on this one. How does it seem to be performing? @Lewdiculous

Lewdiculous

Owner Feb 29, 2024

•

edited Feb 29, 2024

@Test157t Seemed fine, didn't really notice anything too broken/unexpected, but I only used it casually for like 40 mins at around 12K-16K context.

After 12K I was feeling it was being less "natural" or a bit "stiffer", repetitive.

Nothing a few swipes couldn't solve. But it did seem to be more heavily affected after 12k (eg.: sometimes changing message formatting or doing actions with the wrong character).

It did better at working at long context than anything else I tested before, considering I'm using Q4 or IQ4 quants...

Lewdiculous

Owner Mar 6, 2024

@Test157t - for some reason this model keeps getting good feedback from those that used it when I recommended it, personally I still like it, llms are as clear as magic potion brewing xD

zappa2005

Mar 12, 2024

I used the Kunocchini-7b-128k-test-v2_IQ4_XS-imatrix.gguf with current ooba on windows (the one with StreamingLLM support like KoboldCPP), but despite I did, I was not able to get ctx over 8192. It directly rails off and produces gibberish.

Am i doing something wrong? I was looking forward to long contexts (this IQ4_XS fits with 50k ctx with all layers offloaded in a 4080), but sadly I'm not able to manage.

Lewdiculous

Owner Mar 12, 2024

@zappa2005 Pretty sure it's because Ooba doesn't do automatic RoPE scaling.

If you're gonna use a GGUF model, use Koboldcpp, that's why I recommend it, it will handle RoPE scaling automatically based on your --contextsize.

Not to mention, Koboldcpp is much faster than Ooba for GGUF models and has features like Context Shifting that can ensure very fast processing even at big contexts.

My setup Recommendation is Koboldcpp + SillyTavern.

If you need any help or something isn't as expected let me know, I'm happy to help.

Lewdiculous

Owner Mar 12, 2024

This comment has been hidden

Tibbnak

Mar 12, 2024

•

edited Mar 12, 2024

The V2 ggufs have the built-in config corrected to follow yarn scaling at a factor of 16 @zappa2005
The v1 are using a base mistral config that expects sliding window attention (something many inference engines never bothered implementing)

zappa2005

Mar 12, 2024

Thanks for the tip, I'll try to work with NTK RoPE. Regarding Kobold, you are right. In general it is much faster, but with this model even GGUF on ooba with 33/33 works with 35tk/s under full context.

The improvement of the prompt processing times with SmartContext is now also available for ooba, it is called StreamingLLM, as I mentioned above. Works good so far.

Lewdiculous

Owner Mar 12, 2024

•

edited Mar 12, 2024

@zappa2005 ~~I was out of the loop for that, but I'm curious, can you link to some StreamingLLM feature documentation or PR?~~

~~I'm under the impression that Smart Shifting is something different.~~

Found it here... https://github.com/oobabooga/text-generation-webui/pull/4761

Ah, yeah it's pretty much the same thing, that's great QoL!

zappa2005

Mar 12, 2024

The V2 ggufs have the built-in config corrected to follow yarn scaling at a factor of 16 @zappa2005
The v1 are using a base mistral config that expects sliding window attention (something many inference engines never bothered implementing)

Thanks for the tip! Does the yarn scaling factor of 16 directly translates to the alpha_value/NTK RoPE I can set in ooba, or is this in general not possible to project? Just curious how I can manually correct this.

Tibbnak

Mar 12, 2024

•

edited Mar 12, 2024

Kobold cpp detects this if you want to use it manually:
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.0625

@zappa2005

I believe that's an alpha of 16? (Honestly I don't know what's going on behind the scenes that well XD)

Might also be a linear scaling factor (What's an ooba anyways)

zappa2005

Mar 12, 2024

•

edited Mar 12, 2024

Kobold cpp detects this if you want to use it manually:
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.0625
I believe that's an alpha of 16? (Honestly I don't know what's going on behind the scenes that well XD)

Yeah this is weird to me too, and there is also other related information in the GGUF metadata:
'llama.rope.freq_base': '10000.000000'
'llama.rope.scaling.type': 'yarn'
'llama.rope.dimension_count': '128'
'llama.rope.scaling.factor': '16.000000'
'llama.rope.scaling.original_context_length': '8192'
'llama.rope.scaling.finetuned': 'true'

How this translates to the settings in ooba, like .. alpha_value or rope_freq_base (in relation to scaling_type=yarn, which I can not modify or set) ... no idea. Although increasing alpha helped no spitting gibberish anymore after 8k - but if this is the right way to do it?

Might also be a linear scaling factor (What's an ooba anyways)

ooba is common for https://github.com/oobabooga/text-generation-webui

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment