Text Generation
Transformers
PyTorch
mistral
openchat
C-RLFT
conversational
Inference Endpoints
text-generation-inference

Why does this model perform so poorly on DROP compared to OpenHermes?

#29
by yahma - opened

In the Huggingface Open LLM Leaderboard OpenChat performs really well on all the benchmarks except for DROP, where is scores 7.22 vs the 35.79 that OpenHermes-2.5-mistral scores.
Why such poor performance on DROP?

OpenChat org

Probably because open llm leaderboard doesn't use conversation templates and CoT, see discussion of gsm8k here. We manually tested the DROP examples and it worked really well.

imone changed discussion status to closed

Sign up or log in to comment