Is the difference really significant?

#2
by jukofyork - opened

https://arxiv.org/abs/2312.06281

4.2 Repeatability
The benchmark demonstrated good repeatability over multiple benchmark runs for the models tested, with an average 2.93% CV.

miiqu-f16: 83.17
miqu-1: 82.91

I'm not sure if the leader board still uses the same tests as the paper, but it's probably worth running some tests to see if the difference is significant (eg: paired bootstrap or similar).

This result was performed by the EQ-Bench maintainer. It was lower than my multiple repeats. I might have been unlucky on that particular run.

I used a slightly modified inference system that is needed some modifications to that code base (just the ability to pass some parameters to the inference system, nothing weird), and I used the exllamaV2 model. I scored a range of about between 83.8-84.3. The base Miqu model scored between 82.7 and 83.2 for me over multiple runs. I'm quite confident this model is better.

Sign up or log in to comment