SciPhi/SciPhi-Mistral-7B-32k · Gibberish output for 10k tokens.

Oct 30, 2023

The model starts giving gibberish output when the input exceeds 8-10k tokens. It says 32k, is that the context length it supports?

rjmehta changed discussion title from Exllamav2 fails for 12k tokens. to Gibberish output for 10k tokens. Oct 30, 2023

emrgnt-cmplxty

SciPhi-AI org Oct 30, 2023

It was ft'ed with a sliding window on textbook examples which reached a full 32k length.

Can you share the example in question and I will investigate why this occurs?

brucethemoose

Oct 31, 2023

•

edited Oct 31, 2023

rjmehta, what backend are you using?

I was getting problems at high context as well, but I need to go back and test some more.

brucethemoose

Oct 31, 2023

•

edited Oct 31, 2023

OK, so I tested perplexity on ptb at different context sizes:

8K: 14.1376
16K: 32.3866
16K with 2.5 RoPE alpha: 14.5083
32K: 115.6320
32k with 5.0 RoPE alpha: 22.2502

Here is Zephyr (an 8K model) for reference:

8K: 14.2910
16K: 125.4033
16K with 2.5 RoPE alpha: 13.8916 (???)
32K: 710.0536
32k with 5.0 RoPE alpha: 23.5875

So... @rjmehta by default, it seems to be slightly less catastrophic beyond 8K than other finetunes, but still seems to benefit from RoPE alpha scaling like a regular 8K model?

Maybe we have something misconfigured.

emrgnt-cmplxty

SciPhi-AI org Oct 31, 2023

Thanks for sharing these numbers. They are very interesting.

What script did you run with btw? I will start including this as part of my training procedure.

The model was not ft'ed with RoPE, rather it uses Mistral's sliding + fixed window approach.

It has been suggested by some that I run with RoPE or yarn in my next fine-tune. I will figure out how to do this in my next iteration.

brucethemoose

Oct 31, 2023

•

edited Oct 31, 2023

I am no expert on long context training these days, but you might take a look at this?

https://huggingface.co/Yukang/Llama-2-13b-longlora-32k

I just used exllamav2's testing script since its very fast and I already had it setup on my desktop:

https://github.com/turboderp/exllamav2/blob/master/test_inference.py

But it requires a quantized model, so it may be less than ideal for testing during training.

rjmehta

Nov 1, 2023

@brucethemoose I am using the exllamav2 backend. The Mistral says it supports 32k but when used with exllamav2 mistral derivative gptq quantized models, it works fine until 8k. I will try the llongorca-32k. But the true 32k will work if the input context really reaches to that limit while training. Scaling and compression technique tends to over-compress the inputs. Though this works fine for questions like "Summarize" but when asked to extract specific values from 32k context without "RAG" techniques, it fails in extracting such specific information.

rjmehta

Nov 1, 2023

Please correct me if my hypothesis is wrong

brucethemoose

Nov 1, 2023

•

edited Nov 1, 2023

Here's the same test with Amazon's MistralLite:

https://huggingface.co/emozilla/MistralLite

8K: 12.9383
16k: 13.7877
16k with 2.5 RoPE alpha: 15.8783
32k: 15.1048
32k with 2.5 RoPE alpha: 17.9006

That looks like its scaling to 32K without any stretching needed.

@rjmehta

Yeah you are correct, training-free scaling is a serious compromise.

I dunno what the best 32K retrieval models are atm, but I believe llongma has been succeeded by longlora 70B and MistralLite, at the very least.

emrgnt-cmplxty

SciPhi-AI org Nov 1, 2023

Thanks for all the great analysis - it has been informative.

Have people tested Amazon's MistralLite in the wild? It seems that their approach is very high quality, so I would prefer to use this model as a base and then fine-tune it on textbook data for the next pass. I will start looking into how hard this will be.

brucethemoose

Nov 1, 2023

•

edited Nov 1, 2023

Have people tested Amazon's MistralLite in the wild?

Its very good at summarization and retrieval. In fact, if you give it a full context, it sometimes tries to summarize (and succeeds!) or pick out a fact even if you didn't ask for that in the initial prompt.

It seems to retain general knowledge though. I think it would be a good base... or maybe even an acceptable base merged with Zephyr?

Now that you got me thinking about it, I am going to see if merging Lite into other Mistral models will "extend" their context even though the instruct syntax is different.

brucethemoose

Nov 4, 2023

(I was not successful with this, as the models use different vocab)

RonanMcGovern

Dec 8, 2023

•

edited Dec 8, 2023

@brucethemoose @rjmehta - how did you run this for above 16k context? I'm running into this issue:

past key much have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`)

See this issue: past key much have a shape of https://huggingface.co/amazon/MistralLite/discussions/14

brucethemoose

Dec 8, 2023

@brucethemoose @rjmehta - how did you run this for above 16k context? I'm running into this issue:
past key much have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`)
See this issue: past key much have a shape of https://huggingface.co/amazon/MistralLite/discussions/14

I was running an exl2 quantization via exllamav2. It doesn't even use the sliding window, I don't think, the model "just worked" out to 32K.

RonanMcGovern

Dec 8, 2023

Appreciate it @brucethemoose , yeah I guess if you go via exllama that side steps the mistral architecture... probably not a bad option.

I'd still like to know how to get this working. Seems odd if you have to set the sliding window size to be the same as the input context. Seems to defeat the purpose of a sliding window.

brucethemoose

Dec 8, 2023

•

edited Dec 8, 2023

Appreciate it @brucethemoose , yeah I guess if you go via exllama that side steps the mistral architecture... probably not a bad option.

I'd still like to know how to get this working. Seems odd if you have to set the sliding window size to be the same as the input context. Seems to defeat the purpose of a sliding window.

TBH the sliding window is not that great, and you are better off skipping it :P. Full attention at 32K is not very compute expensive, unless you use the unoptimized vanilla transformers code.

RonanMcGovern

Dec 9, 2023

Thanks, yeah agreed, I don't get the sliding window. Doesn't seem an improvement, perhaps a disimprovement.