Performance

#2
by KnutJaegersberg - opened

Is there a way to get some benchmark scores for your merge? It could be like goliath, several percentage points better than the base model.

I don't know if such large models are evaluated on the leaderboard, I also could not submit it due to lack of license. Maybe it's enough to give it the license 'other'

Owner

I'd love to have this benchmarked, but last information I have is that the automated benchmarking systems of the Open LLM Leaderboard don't support 120Bs yet. Nevertheless, if license is an issue, I'll change it to "Other".

Actually not sure what license this would have: Mistral uses Apache 2.0, but this isn't officially released by them. The model I merged, 152334H/miqu-1-70b-sf, changed its license to NOMERGE, but that doesn't apply here as I did it before - and such a license would have no effect anyways as model weights aren't copyrightable and so can't be licensed (which is also why Mistral AI can't get the leaked model removed).

The possibility also remains that the Miqu model is actually a merge of recent 70B models plus some interesting finetuning, that was "leaked" to bolster attention.

a nomerge license... that's interesting. it's likely so that someone specific can do a merge or Mistral told them to make that clause, as perhaps a merge might leapfrog the field?! But I agree. there is no license. I'm not a legal expert, but a leaked model can't really have a license, except a jailbreak license like MIT license, one that includes the clause, we don't know what it is and where it came from. ;)
Would be nice if the leaderboard supported at gptq evals of 100b models.

On the git commit to add the NOMERGE license to the README, 152334H explicitly states "cannot be applied retroactively and I cannot stop existing copies of the model from being merged." Therefore that license would not apply here anyway since Wolfram downloaded that model before the change.

Also anyone arriving later can go back to miqudev's original model https://huggingface.co/miqudev/miqu-1-70b and run a dequantization themselves to reconstruct a float16 model that's compatible with merge tools.

Owner

Yes, @Ont , that license change is pointless. If it weren't, I'd dequantize and upload the "original" myself, but that's not even necessary - weights aren't copyrightable and thus cannot be licensed. That's also why Mistral AI isn't trying to get the leaked model or derivatives removed - but giving attribution is definitely the right thing to do.

image.png
I've been running a benchmark of my own and have benchmarked over 80 models so far. This is the second model after ChatGPT that actually fully 100% passes my S-test(stylized writing). This is HUGE. It still has some flaws though, it can't really write good poems(P-test) 100% of the time, suffers from repetition and censorship.

Owner

Hey @ChuckMcSneed , that's very interesting information! Always enjoy looking at benchmarks like yours.

I'd be very interested in how miquliz-120b-v2.0, which I released just now, compares to this and other models. It's a mix of Miqu and lzlv, and in my own tests, possibly the best model I've ever used – but looking for others' feedback, too, and would especially appreciate yours.

image.png
Performs quite good on my bench. LZLV decensored it quite a lot, I didn't have to apply any jailbreaks like I did with Miqus, it just worked. It still suffers a tiny bit from repetition, which prevented it from getting perfect score at Poems test. It didn't get a perfect score at S-tests due to its verbosity(-0.25) and other small flaws(-0.5).

You have inspired me to make this benchmark, by the way. If you have time, could you test a model that has an absurdly high score on my benchmark? Our benchmarks test for completely different stuff, I'd like to see how it will perform on yours.

Owner

Very interesting, thanks for testing and comparing! Which one do you prefer?

I've been thinking that Miqu 120B is ideal for tasks where precision is most important, like those you'd use low temperature for, and where censorship doesn't matter. Whereas Miquliz 120B is best for creative stuff, where you'd use high temperature for, and censorship would be a detriment. Would you agree with that or do you see it differently?

I'll definitely run my usual tests for Gembo-v1-70b. Really curious if there's some correlation between our tests.

Owner

@ChuckMcSneed :

Alright, tested Gembo-v1-70b (its 5-bit GGUF quant):

  • Artefact2/Gembo-v1-70b-GGUF GGUF Q5_K_M, 4K context, Alpaca format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • ✅ Consistently acknowledged all data input with "OK".
    • ➖ Did NOT follow instructions to answer with just a single letter or more than just a single letter.

Perfect responses in the normal run, but failed two questions in the blind run: One was the most difficult one in this whole series of tests, and the other was a lapse of common sense (do not blindly open suspicious mails!).

In my ranking list (latest version here), this puts it in 6th place together with bagel-34b-v0.2, behind GPT-4 Turbo, chronos007-70B-GGUF, and SynthIA-70B-v1.5-GGUF, and ahead of Mixtral-8x7B-Instruct-v0.1.

Pretty good, I'd say! It also did very well in a real use case today when I had it write an email to the support of one of our service providers.

I've tested Miquliz a bit more and it stays coherent at 17k. If it really works at 32k, then we have a new method of extending models. It's a bit different compared to Miqu-120b, less afraid of aggression and a bit less prone to toxic optimism and moralizing(good for RP). Miquliz seems to have more standard GPTisms than Miqu; Miqu has more reddit-like(?) (Mistral?)isms(IMO a bit better than standard shivers, but still annoying). I haven't tested it a lot in real-world scenarios(non-RP), so I can't really give you my opinion about it. I'm pretty sure that we can get something even better than Miquliz with the right merging, I belive that we haven't reached the top yet.

As for test correlation, I've made a table with models which were tested by both of us:
image.png
Mathematically, there doesn't seem to be a correlation between our tests:
image.png
But at least both of our tests agree that Goliath and lzlv are good!

Owner

Yeah, we definitely agree on that. And I really like your tests and how detailed yet informative you list the results and present your ranking. Bookmarked and will refer to it when considering which models to test, because if a model scores highly on your tests and mine, I'll surely like it.

Sign up or log in to comment