Max context length

#1
by jimlloyd - opened

Hello again @mlabonne . I tried out the GGUF version of this model provided by @Eric111 and for pure chatting with 8K context I like it better than NeuralBeagle-7B (my prior 7B favorite). But it seems to degrade more when I push to larger context lengths. NeuralBeagle is able to work well with context length up to 25600 by setting --rope-freq-base 32000, but NeuralOmniBeagle does not (as measured by the perplexity tool).

I have done only a little experimentation with the new group attention method for extending context lengths. That method seemed to work fine with NeuralBeagle but not so well with NeuralOmniBeagle.

In our prior conversation it seemed to me that you might have been looking for ways to generate long context data for fine tuning models for longer contexts. Have you found something that works? If so, I assume you didn't apply it to the fine tuning used with this model?

As i mentioned in another thread that you helped me out there,. this issue is with all AI models i have used so far. None of them can get large input (even near 4000 tokens) and process them properly and give output. They can partially process them and kinda summarize them, but they can't perform operations on each sentence and paragraph of the input precisely when the input length increases while at the same time they can operate well on the same instructions for small input length.
I guess we need a model that was trained on super long inputs, but unfortunately no one trained and released such model yet as far as i know.

I think we are talking about different layers of the problem. The problem I am concerned with is pretty fundamental: can the model function at all with longer contexts? This can be measured with the perplexity tool. My experience is that models with perplexity scores under 20 can remain coherent when the length of the conversation history approaches the limit of the context window. This test doesn't demonstrate that the LLM is actually capable of using every sentence in the conversation, but it does tell whether the model can remain coherent.

I believed you are concerned with a more subtle problem. Even if the model is coherent for long contexts, can it actually make use of specific information in the context? I'll call this specificity (but correct me if there is an established term I am unaware of).

One way to test specificity is to choose a reference text that nearly fills the context window and then embed (or "hide") inside it at random locations short sections (a few sentences) on some topic independent of the rest of the text, and then ask questions designed to gauge the accuracy of the model for using the embedded text.

My understanding is that fine tuning with longer texts will improve a models ability to remain coherent, but not necessarily improve its specificity. Perhaps it is possible to create training data using the embedding/hiding trick that will allow fine tuning to improve specificity?

I see what you mean and i think both problems are related somehow and training on larger input lengths, can improve perplexity scores , coherency and being able to produce larger output without losing quality.

Owner

@jimlloyd It's a good point, thanks for your feedback. Definitely interested in this, but quite busy with other projects at the moment so no plans to extend the context window right now.

Sign up or log in to comment