Benchmarks?
Hey Sneed, thanks for testing it out. The differences were imperceptible for me at first, so it's nice to see it quantified. 15% does seem brutal and I hope my Endurance finetune fixed that. But either way, if this opens the doors to those who could never run 123B, then I think that's an overall win. I have not tried doing any quantitative test on it, but I'm hoping someone can do us a solid and find out! (And also the Endurance tune)
15% does seem brutal
Well, you've cut away 20% of the model...
15% does seem brutal
Well, you've cut away 20% of the model...
I was told these layers were useless!
@ChuckMcSneed out of curiousity, which GGUF of mistral large did you use for testing?
If you're willing, can you try my Q6_K quant? I would be surprised if there was a big difference from imatrix, but i'm extremely curious: https://huggingface.co/bartowski/Lazarus-2407-100B-GGUF
@ChuckMcSneed out of curiousity, which GGUF of mistral large did you use for testing?
Good question! Instead of downloading the model, I've copied over mergekit yaml and made it locally from Largestral 2407 weights on my drive. I then converted it with llama.cpp version4261 (2759916d)
with --outtype bf16
. To keep as much of the performance as possible I've added --output-tensor-type BF16 --token-embedding-type BF16
arguments to llama-quantize.
If you're willing, can you try my Q6_K quant?
Why? Is there something special about them? Are you trying to give me pickled GGUF? π
Is there something special about them?
Imatrix.
The results are in! And on average they are the same.
@bartowski
's quant failed the same amount of tasks, but at different parts of the benchmark. So, there is a difference in outputs, but it isn't significant enough to shift the average up or down.
Endurance-100B took a nosedive on UGI too.