Text Generation
Transformers
PyTorch
mistral
openchat
C-RLFT
conversational
Inference Endpoints
text-generation-inference

Question about openchat3.5 gsmk8 score on openllm leaderboard.

#23
by balisujohn - opened

First of all, this model is amazing, its seems to speak Japanese and write rhyming poetry in English, and it gave great code and technical advice. It feels smarter than even llama 30b models I have interacted with. But it has a surprising low score on openLLM leaderboard despite this figure:

image.png

reporting near parity with chatgpt. One source of this is that gsm8k seems to be reported as 26.84 on the best run in openLLM leaderboard, whereas on your chart I think it is reported as 77.3 (or at least greater than 62.4).

What's the story here? I'm ready to believe based on my interaction with this that its better than the leaderboard score would indicate, but I'm curious why there might be mismatch.

OpenChat org

gsm8k usually uses CoT for evaluation, but the open llm leaderboard does not use any CoT, see details here. When you look at the MMLU results (HF has previously corrected the MMLU evaluation), the results are better than reported and within the 70B range.

Ah, thanks for clarifying. Excited to see future openchat models!

balisujohn changed discussion status to closed

Sign up or log in to comment