benchmarks?

#1
by distantquant - opened

It would be cool to get a benchmark of this model, at least in ARC, MT-Bench and MMLU.

Owner

Yes, I'd love some independent benchmarks, especially MMLU. The HF leaderboard unfortunately doesn't do 120Bs (a real bummer as I'd have expected Goliath 120B to top it for ages!), so I tried to use EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. to run my own benchmarks.

Unfortunately I can only run the quantized versions myself, so the HF integration of that won't work for me with such a big model. And the OpenAI API integration which lets me use lm_eval with ooba and EXL2 failed with "NotImplementedError: No support for logits." when trying to run the MMLU tests.

Did anyone successfully run MMLU benchmarks for local models with EXL2 or at least GGUF? Would be happy for any pointers/tutorials so I could provide those benchmarks! Or if anyone has a bigger machine, and would be so kind to run some benchmarks, let us know the results...

Sign up or log in to comment