12 17 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

replied to their post about 15 hours ago

Qwen made good students, DeepSeek made a genius. This is my summaries of their differentiations. I don't think these two players are coordinated but they both have clear goals. One is to build ecosystem and the other is to push AGI. And IMO they are both doing really well.

posted an update about 15 hours ago

Common formula to DIY a LLM: Post train a Qwen model with a dataset distilled from DeepSeek 😂

updated a Space about 17 hours ago

onekq-ai/WebApp1K-models-leaderboard

View all activity

Organizations

onekq's activity

replied to their post about 15 hours ago

Question for the Llama team: what will be your play? 😅

posted an update about 15 hours ago

Post

286

Common formula to DIY a LLM:

Post train a Qwen model with a dataset distilled from DeepSeek 😂

posted an update 1 day ago

Post

287

Gemma 3 is not doing well 😕
onekq-ai/WebApp1K-models-leaderboard

replied to their post 1 day ago

Not implying. I like to know what the base is. If QwQ and DeepSeek distill use the same base, then it becomes more puzzling why the performance differ so much.

posted an update 3 days ago

Post

1602

Qwen made good students, DeepSeek made a genius.

This is my summaries of their differentiations. I don't think these two players are coordinated but they both have clear goals. One is to build ecosystem and the other is to push AGI.

And IMO they are both doing really well.

2 replies

replied to their post 3 days ago

Ah I see. Thanks!

Still the blogpost didn't mention what the base model is (if any).

replied to their post 4 days ago

Cool! I will check it out.

What I meant by switching is this. Sometimes I'm not satisfied with ChatGPT answer, and realized it needs to think harder. So I switched to o1 and asked again, and most of the times the answer gets better. Then I asked a simple follow-up question which o1 overanalyzed. Then I had to switch back to gpt-4o. I don't actually have the foresight which model fits my question the best. I only know it after I read the answer which is too late.

Now imagine a conversation with a human expert. A human can do such switching remarkably well, hence a cool conversation. This can be actually a metric to read the mileage of an applicant.

posted an update 5 days ago

Post

1375

The performance of deepseek-r1-distill-qwen-32b is abysmal. I know Qwen instruct (not coder) is quite poor on coding. As such, I have low expectation on other R1 repro works also based on Qwen instruct too. onekq-ai/r1-reproduction-works-67a93f2fb8b21202c9eedf0b

This makes it particularly mysterious what went into QwQ-32B? Why did it work so well? Was it trained from scratch? Anyone has insights about this?
onekq-ai/WebApp1K-models-leaderboard

5 replies

posted an update 6 days ago

Post

723

A bigger and harder pain point for reasoning model is to switch modes.

We now have powerful models capable of either system I thinking or system II thinking, but not both, much less switching between the two. But humans can do this quite easily.

ChatGPT and others push the burden to users to switch between models. I guess this is the best we have now.

2 replies

posted an update 9 days ago

Post

3240

QwQ-32B is amazing!

It ranks below o1-preview, but beats DeepSeek v3 and all Gemini models.
onekq-ai/WebApp1K-models-leaderboard

Now we have such a powerful model that can fit into a single GPU, can someone finetune a web app model to push SOTA of my leaderboard? 🤗

1 reply

posted an update 10 days ago

Post

544

From my own experience these are the pain points for reasoning model adoption.

(1) expensive and even worse, slow, due to excessive token output. You need to 10x your max output length to avoid clipping the thinking process.

(2) you have to filter thinking tokens to retrieve the final output. For mature workflows, this means broad or deep refactoring.

1p vendors (open-source and proprietary) ease these pain points by manipulating their own models. But the problems are exposed when the reasoning model is hosted by 3p MaaS providers.

posted an update 11 days ago

Post

345

The bitter lesson (🏆Sutton🏆) should be the core value of all ML institutions and individuals.

posted an update 13 days ago

Post

2507

I was puzzled by the scope of 🐋DeepSeek🐋 projects, i.e. why they built (then open sourced) so many pieces which are all over their technology stack. Good engineers are minimalists. They build only when they have to.

Then I realized that FP8 should be the main driving force here. So your raw inter-GPU bandwidth is cut in half (H800). But if you compress your data presentation from 16 bits to 8 bits, then the effective throughput of your workload stays unchanged!

The idea is simple but lots of work had to be done. Their v3 technical report will give you a wholistic view (better than reading the code). To summarize, data structure is the foundation to any software. Since FP8 was new and untried, the ecosystem wasn't there. So DeepSeek became the trailblazer. Before cooking your meals, you need to till the land, grow crops, and grind the flour 😅

posted an update 14 days ago

Post

590

H800 is all you need.

This is my summary to 🐋DeepSeek🐋 open source week. H800 is as good as H100, except the NVLink bandwidth is cut in half.

This is a crystal clear challenge, and it rallied and motivated innovations which follow. The rest are details.

posted an update 16 days ago

Post

513

GPT 4.5 has pulled off a pretty decent performance (on a par with Claude 3.7) but apparently there is no new SOTA. OAI already stated that GPT 4.5 is not a frontier model.
onekq-ai/WebApp1K-models-leaderboard

No SOTA for new models by both OAI and Anthropic. This is not a coincidence. You cannot make everyone happy when more and more workflows and applications use a single model.

Vertical models will inevitably rise.

posted an update 19 days ago

Post

2760

Necessity is mother of invention. To understand ⚡FlashMLA⚡ by
🐋DeepSeek 🐋, the first question to ask is why.

The keyword here is H800, a lower-end product tailored for export control. The purpose here is to squeeze out as much performance as possible.

But here is the most important takeaway: this invention benefits EVERYONE.

2 replies

replied to their post 19 days ago

posted an update 20 days ago

Post

2165

Huge disappointment to Claude Sonnet 3.7 😞 Big performance regression. Worse than the June version in 2024. 👎
onekq-ai/WebApp1K-models-leaderboard

I'm sure though this version improves on something, only not the thing my leaderboard measures. This proves the point that no model can be the best on everything.

2 replies

posted an update 24 days ago

Post

2050

Still waiting for 👽Grok👽 3 API ⌛😞😫

replied to their post 28 days ago

Done. So I understand this: you do not change model weights, but rather tweak the inference logic? Somehow remind me of speculative decoding.