HuggingFaceH4/open_llm_leaderboard · Models for Human/GPT4 Eval

Jun 12, 2023

Please comment and react on the models you want us to add! We'll be selecting models from this, rather than automatically running them.

natolambert pinned discussion Jun 12, 2023

canwenxu

Jun 12, 2023

https://huggingface.co/project-baize/baize-v2-13b ;)

TNTOutburst

Jun 12, 2023

•

edited Jun 20, 2023

airoboros-13b-gpt4.ggmlv3.q8_0 https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML
nous-hermes-13b.ggmlv3.q8_0 https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML
These seem to be among the highest performing 13b models (according to certain evaluations), and it would be nice to have them on the leaderboard.

g30rv17ys

Jun 12, 2023

Please comment and react on the models you want us to add! We'll be selecting models from this, rather than automatically running them.

-
https://huggingface.co/WizardLM/WizardLM-30B-V1.0
https://huggingface.co/WizardLM/WizardLM-13B-V1.0
https://huggingface.co/WizardLM/WizardLM-7B-V1.0
https://huggingface.co/YuxinJiang/Lion
https://huggingface.co/TheBloke/selfee-13b-fp16
https://huggingface.co/TheBloke/selfee-7B-fp16
https://huggingface.co/TheBloke/tulu-13B-fp16
https://huggingface.co/TheBloke/tulu-7B-fp16
https://huggingface.co/TheBloke/tulu-30B-fp16
https://huggingface.co/TheBloke/CAMEL-13B-Combined-Data-fp16

delectacion

Jun 12, 2023

What about testing the top 10 models of the LLM benchmark?

tju01

Jun 12, 2023

•

edited Jun 12, 2023

Please evaluate https://huggingface.co/OpenAssistant/falcon-40b-sft-top1-560. It is both a lot newer and better than the pythia-based oasst-12b that was used as one of your initial models. If multiple models are possible, then https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226 would also be nice. If not, then only evaluating falcon-40b-sft-top1-560 would be enough.

saratchinni

Jun 13, 2023

LLama 7B Open source alternative: MPT-7B-Instruct: https://huggingface.co/mosaicml/mpt-7b-instruct

zmcmcc

Jun 13, 2023

•

edited Jun 13, 2023

https://huggingface.co/TheBloke/minotaur-13B-GPTQ
https://huggingface.co/TigerResearch/tigerbot-7b-sft

BigArt

Jun 13, 2023

https://huggingface.co/openlm-research/open_llama_7b
https://huggingface.co/tiiuae/falcon-7b
https://huggingface.co/tiiuae/falcon-7b-instruct

Fredithefish

Jun 13, 2023

•

edited Jun 13, 2023

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF
new model created for reasoning.

slimesli

Jun 13, 2023

•

edited Jun 13, 2023

https://huggingface.co/WizardLM/WizardLM-30B-V1.0
https://huggingface.co/WizardLM/WizardLM-13B-V1.0
WizardLM is currently the best open source model due to their unique fine-tuning method
According to several benchmarks:
https://github.com/aigoopy/llm-jeopardy
https://docs.google.com/spreadsheets/d/1NgHDxbVWJFolq8bLvLkuPWKC7i_R6I6W/edit#gid=2011456595
https://tatsu-lab.github.io/alpaca_eval/
https://www.reddit.com/r/LocalLLaMA/comments/1469343/hi_folks_back_with_an_update_to_the_humaneval/
airoboros models are good too

Yhyu13

Jun 14, 2023

•

edited Jun 14, 2023

RLHF open assiatant 30b
https://huggingface.co/Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-hf

chimera inst 13b claimed to be 97% of chatgpt by gpt4 eval
https://huggingface.co/Yhyu13/chimera-inst-chat-13b-hf

JPTau

Jun 15, 2023

Would love to see Orca once the weights are released!

natolambert

Jun 15, 2023

Great stuff everyone, I'll launch a batch tomorrow / early next week. We'll figure out what throughput of models we can do, but we can generally do many more models with GPT4 evals than Human, but without human it's hard to calibrate 😅

jays

Jun 16, 2023

Add one for falcon-7b-instruct and mpt-7b-instruct & chats please

synexo

Jun 17, 2023

This one: Monero/Manticore-13b-Chat-Pyg-Guanaco
Heard about it on Reddit a few weeks back and I agree it (subjectively) is still the best 13B model I've tried.

tallrichandsom

Jun 20, 2023

Where's the results
?

DannyJP

Jun 21, 2023

•

edited Jun 22, 2023

tiiuae/falcon-40b-instruct
timdettmers/guanaco-65b-merged
HuggingFaceH4/starchat-beta
bigcode/starcoderplus
TheBloke/Wizard-Vicuna-13B-Uncensored-HF
mosaicml/mpt-7b
bigcode/starcoderplus
Salesforce/codegen-16B-nl
facebook/galactica-120b

Microsoft Orca 13B (https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
Google PaLM 2/Bard
Claude+

jimmy6DOF

Jun 23, 2023

Bump up for any of the compressed weight models they need more benchmarking could be its own leaderboard breakout

luccazen

Jun 23, 2023

mosaicml/mpt-30b
mosaicml/mpt-30b-instruct
mosaicml/mpt-30b-chat

larekrow

Jun 28, 2023

lmsys/vicuna-7b-v1.3
lmsys/vicuna-13b-v1.3
lmsys/vicuna-33b-v1.3

facebook/opt-iml-1.3b
facebook/opt-iml-30b
facebook/opt-iml-max-1.3b
facebook/opt-iml-max-30b

Would be good to see the difference between a verbose chatbot LLM and succinct instruction-tuned LLM.

clefourrier

Hugging Face H4 org Jul 3, 2023

@tallrichandsom The GPT/Human eval leaderboard was moved here

willyninja30

Jul 8, 2023

I think you should definetely add falcon 7b and 40b open assistant finetuned version, as on elo rating and from many end users perpective, it's the best and even in terms of feeling superior (falcon 40b OA) to chatgpt quality. I'm realling talking about the real feeling you have using it and results quality.

chrisovato

Jul 9, 2023

Would be good to see the comparison between newest tiiuae/falcon-40b-instruct and GPT4

Nikolaowo

Jul 23, 2023

https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b

clefourrier

Hugging Face H4 org Jul 26, 2023

@natolambert Closing this discussion since the Human and GPT4 evaluation leaderboard moved

clefourrier changed discussion status to closed Jul 26, 2023

clefourrier unpinned discussion Jul 26, 2023