Knut Jägersberg

KnutJaegersberg

AI & ML interests

NLP, opinion mining, narrative intelligence

Articles

Organizations

KnutJaegersberg's activity

replied to BramVanroy's post about 1 month ago
view reply

it mixed up stuff in the output, gave weird answers. didn't have that problem with other models. maybe the update they released sovled that issue, I just never cared, given the alternatives.

replied to BramVanroy's post about 1 month ago
view reply

I got some weird results, since there are a lot of other models in that performance-parameter range, I just didn't try anymore.

replied to macadeliccc's post 3 months ago
replied to bwang0911's post 3 months ago
replied to JustinLin610's post 3 months ago
replied to osanseviero's post 3 months ago
view reply

I hear there is an incredible amount of competition among LLM makers within China, I guess one would publish and thus promote only the best. Hundreds of models. Competition is good for performance.

replied to s3nh's post 3 months ago
view reply

I didn't dive deeply into all the creative role play models, although I sense there is a great deal of innovation happening there, unrecognized. Beautiful art!

replied to their post 3 months ago
view reply

that's a nice space you made there, but it is also unrelated to my post

replied to their post 4 months ago
view reply

I didn't see a link to the prompt in the video, but prompt format can be optimized.

replied to their post 4 months ago
posted an update 4 months ago
view post
Post
Shocking: 2/3 of LLMs fail at 2K context length

code_your_own_ai makes a great vlog about mostly LLM related AI content.
As I watched the video below, I wondered about current best practices on LLM evaluation. We have benchmarks, we have sota LLMs evaluating LLMs, we have tools evaluating based on human comparison.
Often, I hear, just play with the LLM for 15 mins to form an opinion.
While I think for a specific use case and clear expectations, this could yield signal carrying experiences, I also see that one prompt is used to judge models.
While benchmarks have their weaknesses, and are by themselves not enough to judge model quality, I still think systematic methods that try to reduce various scientifically known errs should be the way forward, even for qualitative estimates.
What do you think? How can we make a public tool for judging models like lmsys/chatbot-arena-leaderboard help to leverage standards known in social science?

https://www.youtube.com/watch?v=mWrivekFZMM
·
posted an update 4 months ago
view post
Post
QuIP# ecosystem is growing :)

I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference.
This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.

keyfan/Qwen-72B-Chat-2bit

Also notice the easier to use Quip# for all library :)

https://github.com/chu-tianxiang/QuIP-for-all
  • 2 replies
·
posted an update 4 months ago
view post
Post
Microsoft: Improving Text Embeddings with Large Language Models

- uses an LLM instead of complex pipelines to create the training data
- directly generates data for numerous text embedding tasks
- fine tunes standard models with contrastative loss achieving great performance
- critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.

Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P


https://arxiv.org/abs/2401.00368
  • 2 replies
·
replied to fffiloni's post 4 months ago