Very interesting that miqu will give 16k context work even only first layer and last layer

#2
by akoyaki - opened

I borrowed (copied) your mergekit_config and made a 120b model with miqu, but only used the first and last layers of miqu, with the middle being filled by the other 70b models. This way, it actually gives a response in a 16k context, and while the response quality is poor, like tinyllama, at least it doesn't mess up at 16k like the other 4k~8k models.
I'm curious to see if 120b can get consistent quality in a 16k context, by mixing in more miqu as a standard for what 120b can get at 16k in the future.

@akoyaki , Please give it a test and help me out. I am looking for a model that can truly handle large inputs. For example inputs with more than 2000 words. Could you verify that this model can get a long article with more than 2000 words and rewrite and expand it without loosing details of it and gives a long output with more than i.e. 2500 words?

@akoyaki , Please give it a test and help me out. I am looking for a model that can truly handle large inputs. For example inputs with more than 2000 words. Could you verify that this model can get a long article with more than 2000 words and rewrite and expand it without loosing details of it and gives a long output with more than i.e. 2500 words?

I tried asking the AI to generate a story, 900 words using 1300 tokens, so you should need a model with 7000* of context (3000 for memorising the 2000 words above and 4000 for generating the 2500 words below), and most of the models can be scaled up to 8K contexts (albeit with a reduced intelligence)
I used my own merge's model for testing, and it worked basically as required in the smaller example (slightly expanding the abbreviated story of 360 words to 460 words). The story line remains basically the same.
But I'm not so sure about LLM's ability to maintain detail in a 2000 word test. My personal experience, the creativity needed for stories, and the high confidence needed to maintain detail, conflict in LLM's sampler.

original

The first of Arden's adventures took place mere days after his arrival in the strange new land. He had wandered aimlessly through the bustling streets, marveling at the towering edifices that reached for the sky like granite fingers and the continuous stream of people that flowed like an unending river through the city's veins. As he passed by a shabby-looking alleyway, a desperate scream ripped through the air.
Instinctively, Arden sprinted toward the source of distress, sword drawn, ready to face whatever horrors this modern day Scylla would throw at him. Bursting around the corner, he found himself confronted with a far different scene than he had expected. Instead of a ferocious monster, he came upon a young woman, no older than sixteen or seventeen, cowering against the wall, clutching her purse to her chlyest as three burly men advanced on her. They leered crude jokes amongst themselves, their intentions clear as daylight.
"Halt!" The booming voice echoed down the narrow passageway, startling both the assailants and victim alike. Turning as one, they regarded Arden warily; a figure clad in armor and wielding a weapon in such a casual manner was not a common sight even in these times.
"Step aside, knave," Arden said coldy. His eyes, steely gray under the neon lights above, seemed to bore into each man present, daring them to defy him. "Leave the lady be, lest you wish to taste my steel."
As if spurred on by some invisible force, the ruffians broke ranks, fleeing in every direction but one: the one leading away from the enraged warrior before them. With a satisfied grin, Arden sheathed his blade, turning his attention towards the girl, who trembeld aginst the wall. Gently, he extended a gauntleted hand, helping her back to her feet. "Are you alright?"
She nodded mutely, tears streaming down her face. He could sense her fear ebbing away, replaced by gratitude and something else - awe perhaps? Regardless, it mattered little to him. A life saved is a job well done, as the saying goes. Leading her to the nearest police station, Arden explained what happened and bid her farewell.

rewrite

As Arden wandered through the city's maze-like streets, marveling at the towering edifices that reached for the sky like granite fingers and the neverending river of people flowing through its veins, a blood-curdling scream rang out. Startled, he spun on his heels, locating the source of distress in an alleyway not far off. The sound sent shivers down his spine - it was a cry for help; someone needed him.
His armor clanking softly with each stride, Arden sprinted towards the alleyway, emerging just in time to witness three burly men accosting a young woman cowering against a wall. They leered crude jokes among themselves as they advanced upon her, their intentions clear as daylight.
"Stop!" He bellowed, startling all present, including the victim. As one unit, the assailants turned around to face the intruder, their eyes widening in disbelief and fear. A man in full battle gear, brandishing a sword? Unheard of! "Stand aside or feel my steel."
The bravest/biggest of them spoke up, "Who do you think you are?" he snarled, flexing his muscles, trying to intimidate the newcomer.
"I am Arden," he replied, sheathing his blade but keeping both hands ready just in case. "A knight errant, sworn to protect those in need, no matter where I roam."
"Yeah right," another thug chided, "You expect us to believe that rubbish?"
The third one remained silent, weighing their odds. He didn't much fancy dying today, especially not at the hand of some lunatic claiming to be a hero.
"Leave now," Arden said evenly, unperturbed by their jeering. "Or else."
The first thug sneered, "Fat chance!" With that, he lunged forward swinging wildly. Reacting faster than lightning, Arden sidestepped and grabbed hold of his arm, twisting it behind him into an agonizing lock. His comrade soon joined him on the ground, howling in pain after receiving a well-placed kick between the legs from Arden's boot.
The remaining attacker took this opportunity to reevaluate his options, then bolted, fleeing back into the crowded streets, never once looking back.
Arden helped the girl to her feet. She was trembling, tears streaming down her face. Gently, he sheathed his weapon and extended a gauntleted hand. "Are you okay?"
She nodllowed mutely, unable to find her voice yet. Grasping his hand tightly, she stood up.
He looked around warily, making sure there were no other threats lurking nearby. Satisfied that they were alone, he began leading her towards the nearest police station. "What's your name?"
"K… Katie." She stammered. "Thank you. I owe you my life."
"Think nothing of it." He replied modestly, "It's what I'm here for." At the police station, Arden recounted the entire episode to a skeptical officer who finally believed them when others corroborated their story. Katy gave her statement and thanked him profusely before disappearing into the crowd.

Taking it a step further, the first and last layers are miqu, and added layers 1-16 and 25-40 using miqu, which can basically maintain intelligence in 12k contexts, can understand metaphors and witticisms, and can correctly reply with something similar.
But punctuation is a bit broken (it works, but is unstable).
I'm guessing adding another portion of miqu to layers 70-79 helps with this, and the rest can fill in the other 70b to modify the tendencies of the target 120b model.
Maybe it's safe to say, welcome to the new era of 120b + 12k~16k as standard?

@akoyaki , thank you so much for the tests and sharing your experience. do you believe that only using Miqu model can suit my needs or i should use a merged model? I am not sure if any model can be scaled up to 8K contexts based on my experience :(

https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge this one can work with 16k context, confirmed (over 20k it will take 30~50s to start reply)
This model(miquliz) can work with 16k context, confirmed
I believe almost all Llama2 models can support 8K contexts with Rope scaling? Though it can lead to AI's intelligence being impaired

I think miqu is worth a try, but it's not too obedient
Maybe https://huggingface.co/grimulkan/Aetheria-longLORA-70b-rope8-32k-fp16 worth a shot? It's a Storytelling model so maybe can do better on story or article?

Thank you again for sharing your knowledge. I really appreciate it. I will try Yi-34B-200K-RPMerge and see how it goes.

Owner

Yes, @akoyaki , very much appreciated from me, too.

Did you publish your merged model on HF? Feel free to link it here!

I can also confirm that having some pure Miqu layers at the beginning and the end of the merge seems to extend the coherent context range of the other model(s) used in the merge.

@wolfram nah thanks for you kind but its just a test I dont take care about another fill part, so it not so smart and emotional as I want (thourgh it pass the "handcuffed test" which goliath dont), i trying on some another model and hope it will great as Lemalin with 12k...besides, I have only 4m/s like you(I remeber you said that in reddit?) but not stable so uploading to hf is often get cancelled and waste my time and bandwidth(1tb every month) limit lol
@sophosympatheia yea miqu is very interesting sauce(or glue?) to merge multi 70b and expend them to 12~16k, I want find a recipe to make sure how many it will handle 12k and how many it need at the least, to squeeze more another 70b, basically saying I want 12~16k context but dont want miqu-style lol. And many thanks for your midnight, it is great model for merge into, use wintergodness and midnight I got the most obedient and smart merged model I ever see.
Miqu is not so obedient if ask (but not so "instruction") only 2 paragraphs it usually ignore that and spit out 4~6 paragraphs, I hope use it but at less as I necessary, and merge more midnight/wintergodness/gembo/lzlv to get a model that have 12k~16k but more smart and more emotional for RP

Miqu is not so obedient if ask (but not so "instruction") only 2 paragraphs it usually ignore that and spit out 4~6 paragraphs, I hope use it but at less as I necessary, and merge more midnight/wintergodness/gembo/lzlv to get a model that have 12k~16k but more smart and more emotional for RP

I am working on a 103B Miqu-Midnight merge that I hope will capture some of what makes both Miqu and Midnight-Rose special. I'm close, but I haven't quite tuned the merge parameters to my satisfaction yet. Fingers crossed that within a week I'll have pinned it down.

The new 120b model test largely maintains performance at 16k (or at least a loss of performance that can't be noticed in short-term tests), no misspellings, can understand metaphors and jokes and substitutions, very few symbols lost, correctly responds to full paragraphs rather than some broken phrases
And it replies with some of the model features used as recipes, so I guess the next step is to tweak the model ratios in the filler part

A reverse test, where the first and last layers do not use miqu, and a portion of miqu is populated at the very top and very end, outputs gibberish
It looks like miqu is required as the first and last layer

Owner

Very interesting findings. If the first and last layers are enough to enlarge context like that, and in better ways than RoPE scaling does, that can have a big impact on how we can give older models more context without retraining.

Has anyone analyzed what exactly Mistral AI did to those layers? And if that could be applied to already big-context-models, too, for even more context by changing just the first and last layers?

I've noticed some interesting effects on alpha_rope values needed for coherence in my blends where I have linearly merged some layers of Midnight Rose with Miqu's early layers.
When I started the transition around layer 10 into miqu, alpha_rope 1 no longer worked but "normal" alpha_rope values for Llama2 worked, like alpha_rope 2 for 7168 context.
When I delayed the start of the transition to around layer 13 - 14 into miqu, alpha_rope 1 was performant at 7168 context where it had been broken before. The same pattern held when delaying the transition until layer 20 into miqu. The strength of the transition didn't seem to matter, only when the transition began.
For all these tests, I mirrored the same transition pattern for the end layers. For example, if the first 20 layers were 100% miqu then the last 20 layers were also 100% miqu and so forth.
In my testing, which has mostly been at 7168 context so not pushing it by any means, the performance of the model didn't seem to suffer when alpha_rope 1 was broken and it didn't seem to necessarily be better at alpha_rope 1 in the versions where that was coherent.

Owner

Frankenstein merging is so spot on - pretty crazy what we're doing with these models. I wish https://github.com/turboderp/exllamav2/pull/275 would gain more traction, and maybe that could even be extended to the cross-model-merging as well to load layers at runtime and avoid creating all these huge model files on disk. Sure would help make experimenting easier and faster, thus leading to results faster.

Sign up or log in to comment