Congrats!
Great result, congrats!
Although I can't help but feel you used my methods here... (lol, joke had to be made I'm sorry)
Thanks for sharing!
I do wonder though: it seems like yours (whilst performing good overall, let there be no doubts about that) does see the steepest increase in performance in the GSM8K benchmark.
And as somebody rightfully pointed out on my model page: The intel neural chat data includes GSM8k, which is also part of the leaderboard test.
As you know im really new to all of this so I am actually not quite sure how big of a difference this would make and how much it would influence
- Benchmarking results
(and more importantly:)
- How that would translate to actual model performance versus expected performance based on the benchmarking results.
Could you chime in on that?
Would it make a substantial difference in either results or in relationship to actual model performance in a real scenario?
Isn't it data from the training split of GSM8k? I don't think that the neural chat data is contaminated (but I might be wrong). If it's really test data, it makes the dataset absolutely useless :(
I don't completely rely on the Open LLM Leaderboard and I use another benchmark suite (with https://github.com/mlabonne/llm-autoeval) for this purpose. It doesn't include GSM8k.