Is the difference really significant?

by jukofyork - opened Mar 22

Discussion

jukofyork

Mar 22

•

edited Mar 22

https://arxiv.org/abs/2312.06281

4.2 Repeatability
The benchmark demonstrated good repeatability over multiple benchmark runs for the models tested, with an average 2.93% CV.

miiqu-f16: 83.17
miqu-1: 82.91

I'm not sure if the leader board still uses the same tests as the paper, but it's probably worth running some tests to see if the difference is significant (eg: paired bootstrap or similar).

dnhkng

Mar 22

This result was performed by the EQ-Bench maintainer. It was lower than my multiple repeats. I might have been unlucky on that particular run.

I used a slightly modified inference system that is needed some modifications to that code base (just the ability to pass some parameters to the inference system, nothing weird), and I used the exllamaV2 model. I scored a range of about between 83.8-84.3. The base Miqu model scored between 82.7 and 83.2 for me over multiple runs. I'm quite confident this model is better.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment