Scores very highly on EQ Bench!

#3
by BoshiAI - opened

Myself and a few pals over at the Faraday discord have been running EQBench(.com) over some popular RP LLMs, and I just wanted to let you know that Midnight Rose v2.0.3 is totally aceing it!

First pass result was 78.3, putting MRv2 at #3 on our table so far.

This puts your model ahead of LZLV (#4 with 77.25) and Goliath 120B (#5 with 76.76), and up there with Miqu 70B Alpaca DPO (80.86) and Senku 70B (81.82). All except Goliath were tested on IQ3_XXS quants (Q3_K_M).

We'll be rerunning the tests, so it might average out up or down slightly, but I've found it doesn't usually change the score by more than a point or two at most.

Just wanted to share the joy with you as it's a great result. Worth noting, in case you feel like running tests of your own, that we had to specify the Vicuna v1.1 prompt to get the test to complete. (I initially selected Alpaca and it wasn't having any of it; the responses were legible but not always in the required format.)]
Screenshot 2024-02-27 at 20.54.16.png

Woah. πŸ˜… Thanks for sharing these results with me. I'm proud to see my merge performing well on an objective test. What a jump from Midnight Rose V1 to V2.

I've been working on merging Midnight Rose V2 with Miqu and I'm close to releasing something. I'll be curious to see how it stacks up.

Oh, I am REALLY looking forward to seeing the results of that merge!! I've been keeping a close eye on your page for that reason!

Maybe you could call it Miqnight Rose, or Midnight Miqu? 😁

MRv2 model is my favourite of the models 8K context and below, but I have been eyeing those Miqu and Qwen 1.5 16K+ contexts with envy, messing around between your model, Senku, Miqu Alpaca DPO, Quartet and Qwen, and wanting to see a nice merger that can bring your RP pizzazz to those models.

I've heard of Wolfram's Miquliz, but can't run a 120B on my system. I believe a MR merge would beat it, at least on the EQ test, because MR already beats LZLV :)

103B is at the upper bounds of what I can run - but it'd have to be a IQ2_XXS quant and performance could take a hit. Is there any possibility of a 70B version of your new merge after a 103B, or 120B version? I totally understand why yourself and Wolfram are merging Miqu up - but 70B is so accessible to so many people now thanks to the new quants. I'm sure a merger of Midnight Rose into Miqu at 70B would be chart-topping of all the current models on the EQ test and more importantly I know I'd be able to run it. Plus you can be sure of another EQBench score lol.

Great minds think alike! I had already decided on calling it Midnight Miqu. πŸ˜„

The 70B version is uploading now. I tried over a dozen variations of frankenmerging Miqu with Midnight Rose, even linearly blending some layers near the borders between them, and I got some decent results. Then I took some lessons I learned from that process and did a straight up SLERP merge between them and that blew everything else out of the water. The 70B version performed strongly in my own subjective testing, and the 103B version made from extending that model might just be my new favorite.

My next merge is going to explore a rebuild of Midnight Rose using Nous Hermes 2 70B and Opus 70B v1.0. I think those models may enable me to get a similar result without using the weird blend of herbs and LoRAs that I used in Midnight Rose v2.0.3, which I suspect injected some creativity at the cost of performance. Once I nail that down, then I'll merge the result with Miqu and we'll see how that does.

Oh, I love you, you just made my evening! Maybe my week! lol

I figured I'd have to wait for a 103B or 120B version before hopefully getting to see a 70B model down the line. I never expected to be seeing a 'Midnight Miqu'... well, today! haha

So you found a SLERP-merged 70B model to be the natural first step to making a better 103B? Interesting! How did Wolfram go about making their 103B model? I wonder if repeating your own process might lead to be a improved miquliz as well? (And from a selfish point of view, also lead to a 70B miquliz as a stepping-stone to an improved 120B model from Wolfram? lol)

This is an exciting time for me to be alive as a 32GB Mac Studio owner. First we hear the '21GB VRAM limit' is a command line prompt away from being whatever we want it to be, then decent IQ3 & IQ2 quants make 70Bs accessible in under 32GB, and now this! I'm very much looking forward to trying this new model and probably becoming one of your biggest fans lol.

Have you had much opportunity to try stretching out the model's context buffer? I know a lot of the models and datasets weren't intended for large context sizes but they've been getting bigger and better, and I know the art of merging and mixing has improved to the point it may not matter as much. For me, a 16K version of Midnight Rose (on steroids) would be a dream come true. I couldn't run a context larger than that right now anyway.

I'll be sure to run MM through the EQBench test again once it's ready. :)

I stay in touch with @wolfram . I'll be sure to give them a heads up when the model is out so they can inspect my mergekit configuration and decide how it might be applicable to miquliz. Wolfram made the miqu-1-103b model using the same frankenmerge approach I use. See here. They even gave me credit for it, which was unnecessary but kind of them.

The SLERP approach I used for Midnight-Miqu-70B should work equally well with any other 70B model, and I'm sure a blend with LZLV will be strong. I trust Wolfram to impress us all.

Have you had much opportunity to try stretching out the model's context buffer?

You're going to love this part. It turns out that any merge that preserves the beginning and end layers of miqu seems to gain miqu's 32K context coherence, at least to a degree. See this discussion. I'll admit I've been lazy and haven't tested any of these new merges past ~8K context, but I bet they can push 16K and maybe even 32K without taking a dump. I'll play around with that tonight. I just need to make a lower bpw quant of Midnight-Miqu-70B that I can test out to those extreme context lengths without hitting OOM errors.

What I know for sure right now is you can run Midnight-Miqu-70B at 8K context with alpha_rope set to 1 and it's fine, which is something new. I still think the model's performance is a little better when you set the alpha_rope to what you would normally use for a stock Llama2 model at that expanded context length (e.g. alpha_rope 2.5 for 8K context), but it's a small difference and I might just be imagining it.

I'm glad Mistral has been cool about what the community has been doing with miqu. They could have brought down the hammer, but they're letting us play around with it for personal use (at least for now). Thanks, Mistral. πŸ™ We love you.

May I suggest using IQ3_XXS on quartet anemoi too? I'm having a lot of fun on this one?

https://huggingface.co/Nexesenex/alchemonaut_QuartetAnemoi-70B-iMat.GGUF/tree/main

I'm using the Q4_0 myself though from this branch:

https://huggingface.co/alchemonaut/QuartetAnemoi-70B-t0.0001-GGUF/discussions/1

Sign up or log in to comment