code_your_own_ai makes a great vlog about mostly LLM related AI content. As I watched the video below, I wondered about current best practices on LLM evaluation. We have benchmarks, we have sota LLMs evaluating LLMs, we have tools evaluating based on human comparison. Often, I hear, just play with the LLM for 15 mins to form an opinion. While I think for a specific use case and clear expectations, this could yield signal carrying experiences, I also see that one prompt is used to judge models. While benchmarks have their weaknesses, and are by themselves not enough to judge model quality, I still think systematic methods that try to reduce various scientifically known errs should be the way forward, even for qualitative estimates. What do you think? How can we make a public tool for judging models like lmsys/chatbot-arena-leaderboard help to leverage standards known in social science?
I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference. This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.