Models for Human/GPT4 Eval

#65
by natolambert - opened

Please comment and react on the models you want us to add! We'll be selecting models from this, rather than automatically running them.

natolambert pinned discussion

airoboros-13b-gpt4.ggmlv3.q8_0 https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML
nous-hermes-13b.ggmlv3.q8_0 https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML
These seem to be among the highest performing 13b models (according to certain evaluations), and it would be nice to have them on the leaderboard.

What about testing the top 10 models of the LLM benchmark?

Please evaluate https://huggingface.co/OpenAssistant/falcon-40b-sft-top1-560. It is both a lot newer and better than the pythia-based oasst-12b that was used as one of your initial models. If multiple models are possible, then https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226 would also be nice. If not, then only evaluating falcon-40b-sft-top1-560 would be enough.

LLama 7B Open source alternative: MPT-7B-Instruct: https://huggingface.co/mosaicml/mpt-7b-instruct

RLHF open assiatant 30b
https://huggingface.co/Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-hf

chimera inst 13b claimed to be 97% of chatgpt by gpt4 eval
https://huggingface.co/Yhyu13/chimera-inst-chat-13b-hf

Would love to see Orca once the weights are released!

Great stuff everyone, I'll launch a batch tomorrow / early next week. We'll figure out what throughput of models we can do, but we can generally do many more models with GPT4 evals than Human, but without human it's hard to calibrate ๐Ÿ˜…

Add one for falcon-7b-instruct and mpt-7b-instruct & chats please

This one: Monero/Manticore-13b-Chat-Pyg-Guanaco
Heard about it on Reddit a few weeks back and I agree it (subjectively) is still the best 13B model I've tried.

Where's the results
?

tiiuae/falcon-40b-instruct
timdettmers/guanaco-65b-merged
HuggingFaceH4/starchat-beta
bigcode/starcoderplus
TheBloke/Wizard-Vicuna-13B-Uncensored-HF
mosaicml/mpt-7b
bigcode/starcoderplus
Salesforce/codegen-16B-nl
facebook/galactica-120b

Microsoft Orca 13B (https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
Google PaLM 2/Bard
Claude+

Bump up for any of the compressed weight models they need more benchmarking could be its own leaderboard breakout

mosaicml/mpt-30b
mosaicml/mpt-30b-instruct
mosaicml/mpt-30b-chat

lmsys/vicuna-7b-v1.3
lmsys/vicuna-13b-v1.3
lmsys/vicuna-33b-v1.3

facebook/opt-iml-1.3b
facebook/opt-iml-30b
facebook/opt-iml-max-1.3b
facebook/opt-iml-max-30b

Would be good to see the difference between a verbose chatbot LLM and succinct instruction-tuned LLM.

Hugging Face H4 org

@tallrichandsom The GPT/Human eval leaderboard was moved here

I think you should definetely add falcon 7b and 40b open assistant finetuned version, as on elo rating and from many end users perpective, it's the best and even in terms of feeling superior (falcon 40b OA) to chatgpt quality. I'm realling talking about the real feeling you have using it and results quality.

Would be good to see the comparison between newest tiiuae/falcon-40b-instruct and GPT4

Hugging Face H4 org

@natolambert Closing this discussion since the Human and GPT4 evaluation leaderboard moved

clefourrier changed discussion status to closed
clefourrier unpinned discussion

Sign up or log in to comment