Knut Jägersberg
AI & ML interests
Articles
Organizations
KnutJaegersberg's activity
arco consistently outperforms every sota model below 600m parameters on average
appvoid/arco
https://huggingface.co/blog/KnutJaegersberg/first-principles-prompt-engineering
99% of the performance across various benchmarks!
mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq
Requant of the big llama, using 20% less memory
neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
I don't agree with some of the assertions made here, but it is an interesting paper and a good overview.
https://arxiv.org/abs/2401.13142
Don't burn out! Lighten up again will you.
It's a nice perspective outlined in here.
“When a measure becomes a target, it ceases to be a good measure.”
— Goodhart’s Law
https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
it mixed up stuff in the output, gave weird answers. didn't have that problem with other models. maybe the update they released sovled that issue, I just never cared, given the alternatives.
I got some weird results, since there are a lot of other models in that performance-parameter range, I just didn't try anymore.
Want this to run on CPU
Exciting!
Thanks for sharing!
I hear there is an incredible amount of competition among LLM makers within China, I guess one would publish and thus promote only the best. Hundreds of models. Competition is good for performance.
I didn't dive deeply into all the creative role play models, although I sense there is a great deal of innovation happening there, unrecognized. Beautiful art!
that's a nice space you made there, but it is also unrelated to my post
I didn't see a link to the prompt in the video, but prompt format can be optimized.
Amazing, thank you for sharing :)
code_your_own_ai makes a great vlog about mostly LLM related AI content.
As I watched the video below, I wondered about current best practices on LLM evaluation. We have benchmarks, we have sota LLMs evaluating LLMs, we have tools evaluating based on human comparison.
Often, I hear, just play with the LLM for 15 mins to form an opinion.
While I think for a specific use case and clear expectations, this could yield signal carrying experiences, I also see that one prompt is used to judge models.
While benchmarks have their weaknesses, and are by themselves not enough to judge model quality, I still think systematic methods that try to reduce various scientifically known errs should be the way forward, even for qualitative estimates.
What do you think? How can we make a public tool for judging models like lmsys/chatbot-arena-leaderboard help to leverage standards known in social science?
https://www.youtube.com/watch?v=mWrivekFZMM
I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference.
This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.
keyfan/Qwen-72B-Chat-2bit
Also notice the easier to use Quip# for all library :)
https://github.com/chu-tianxiang/QuIP-for-all
- uses an LLM instead of complex pipelines to create the training data
- directly generates data for numerous text embedding tasks
- fine tunes standard models with contrastative loss achieving great performance
- critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.
Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P
https://arxiv.org/abs/2401.00368
how did you do that?