Gibberish output for 10k tokens.
The model starts giving gibberish output when the input exceeds 8-10k tokens. It says 32k, is that the context length it supports?
It was ft'ed with a sliding window on textbook examples which reached a full 32k length.
Can you share the example in question and I will investigate why this occurs?
rjmehta, what backend are you using?
I was getting problems at high context as well, but I need to go back and test some more.
OK, so I tested perplexity on ptb at different context sizes:
8K: 14.1376
16K: 32.3866
16K with 2.5 RoPE alpha: 14.5083
32K: 115.6320
32k with 5.0 RoPE alpha: 22.2502
Here is Zephyr (an 8K model) for reference:
8K: 14.2910
16K: 125.4033
16K with 2.5 RoPE alpha: 13.8916 (???)
32K: 710.0536
32k with 5.0 RoPE alpha: 23.5875
So... @rjmehta by default, it seems to be slightly less catastrophic beyond 8K than other finetunes, but still seems to benefit from RoPE alpha scaling like a regular 8K model?
Maybe we have something misconfigured.
Thanks for sharing these numbers. They are very interesting.
What script did you run with btw? I will start including this as part of my training procedure.
The model was not ft'ed with RoPE, rather it uses Mistral's sliding + fixed window approach.
It has been suggested by some that I run with RoPE or yarn in my next fine-tune. I will figure out how to do this in my next iteration.
I am no expert on long context training these days, but you might take a look at this?
https://huggingface.co/Yukang/Llama-2-13b-longlora-32k
I just used exllamav2's testing script since its very fast and I already had it setup on my desktop:
https://github.com/turboderp/exllamav2/blob/master/test_inference.py
But it requires a quantized model, so it may be less than ideal for testing during training.
@brucethemoose I am using the exllamav2 backend. The Mistral says it supports 32k but when used with exllamav2 mistral derivative gptq quantized models, it works fine until 8k. I will try the llongorca-32k. But the true 32k will work if the input context really reaches to that limit while training. Scaling and compression technique tends to over-compress the inputs. Though this works fine for questions like "Summarize" but when asked to extract specific values from 32k context without "RAG" techniques, it fails in extracting such specific information.
Please correct me if my hypothesis is wrong
Here's the same test with Amazon's MistralLite:
https://huggingface.co/emozilla/MistralLite
8K: 12.9383
16k: 13.7877
16k with 2.5 RoPE alpha: 15.8783
32k: 15.1048
32k with 2.5 RoPE alpha: 17.9006
That looks like its scaling to 32K without any stretching needed.
Yeah you are correct, training-free scaling is a serious compromise.
I dunno what the best 32K retrieval models are atm, but I believe llongma has been succeeded by longlora 70B and MistralLite, at the very least.
Thanks for all the great analysis - it has been informative.
Have people tested Amazon's MistralLite in the wild? It seems that their approach is very high quality, so I would prefer to use this model as a base and then fine-tune it on textbook data for the next pass. I will start looking into how hard this will be.
Have people tested Amazon's MistralLite in the wild?
Its very good at summarization and retrieval. In fact, if you give it a full context, it sometimes tries to summarize (and succeeds!) or pick out a fact even if you didn't ask for that in the initial prompt.
It seems to retain general knowledge though. I think it would be a good base... or maybe even an acceptable base merged with Zephyr?
Now that you got me thinking about it, I am going to see if merging Lite into other Mistral models will "extend" their context even though the instruct syntax is different.
(I was not successful with this, as the models use different vocab)
@brucethemoose @rjmehta - how did you run this for above 16k context? I'm running into this issue:
past key much have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`)
See this issue: past key much have a shape of https://huggingface.co/amazon/MistralLite/discussions/14
@brucethemoose @rjmehta - how did you run this for above 16k context? I'm running into this issue:
past key much have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`)
See this issue: past key much have a shape of https://huggingface.co/amazon/MistralLite/discussions/14
I was running an exl2 quantization via exllamav2. It doesn't even use the sliding window, I don't think, the model "just worked" out to 32K.
Appreciate it @brucethemoose , yeah I guess if you go via exllama that side steps the mistral architecture... probably not a bad option.
I'd still like to know how to get this working. Seems odd if you have to set the sliding window size to be the same as the input context. Seems to defeat the purpose of a sliding window.
Appreciate it @brucethemoose , yeah I guess if you go via exllama that side steps the mistral architecture... probably not a bad option.
I'd still like to know how to get this working. Seems odd if you have to set the sliding window size to be the same as the input context. Seems to defeat the purpose of a sliding window.
TBH the sliding window is not that great, and you are better off skipping it :P. Full attention at 32K is not very compute expensive, unless you use the unoptimized vanilla transformers code.
Thanks, yeah agreed, I don't get the sliding window. Doesn't seem an improvement, perhaps a disimprovement.