28 2 10

M Veselovskiy

Yuuru

AI & ML interests

None yet

Recent Activity

new activity 19 days ago

mistralai/Mistral-Small-3.1-24B-Instruct-2503:HF Format?

reacted to m-ric's post with 👀 about 2 months ago

𝗔𝗱𝘆𝗲𝗻'𝘀 𝗻𝗲𝘄 𝗗𝗮𝘁𝗮 𝗔𝗴𝗲𝗻𝘁𝘀 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘀𝗵𝗼𝘄𝘀 𝘁𝗵𝗮𝘁 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗥𝟭 𝘀𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝘀 𝗼𝗻 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝘁𝗮𝘀𝗸𝘀! ❌ ➡️ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system. So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand. 👎 But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers. 🧐 These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well. But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data. It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! 🚀 Read more in the blog post 👉 https://huggingface.co/blog/dabstep

new activity 6 months ago

TheDrummer/UnslopSmall-22B-v1-GGUF:Metharme format makes model extremely stupid

View all activity

Organizations

Yuuru's activity

New activity in mistralai/Mistral-Small-3.1-24B-Instruct-2503 19 days ago

HF Format?

#2 opened 19 days ago by

bartowski

reacted to m-ric's post with 👀 about 2 months ago

Post

3754

𝗔𝗱𝘆𝗲𝗻'𝘀 𝗻𝗲𝘄 𝗗𝗮𝘁𝗮 𝗔𝗴𝗲𝗻𝘁𝘀 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘀𝗵𝗼𝘄𝘀 𝘁𝗵𝗮𝘁 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗥𝟭 𝘀𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝘀 𝗼𝗻 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝘁𝗮𝘀𝗸𝘀! ❌

➡️ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.

So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.

👎 But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.

🧐 These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.

It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! 🚀

Read more in the blog post 👉 https://huggingface.co/blog/dabstep

New activity in TheDrummer/UnslopSmall-22B-v1-GGUF 6 months ago

Metharme format makes model extremely stupid

#1 opened 6 months ago by

Ainonake

upvoted a collection 7 months ago

Qwen2.5

Collection

Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 46 items • Updated Feb 26 • 580

New activity in G-reen/gpt5o-reflexion-q-agi-llama-3.1-8b 7 months ago

How to pay

#17 opened 7 months ago by

Yuuru

New activity in mattshumer/Reflection-Llama-3.1-70B 7 months ago

DLETE THIS MODEL

#76 opened 7 months ago by

MaziyarPanahi

reacted to m-ric's post with 👍 7 months ago

Post

1913

🤯 𝗔 𝗻𝗲𝘄 𝟳𝟬𝗕 𝗼𝗽𝗲𝗻-𝘄𝗲𝗶𝗴𝗵𝘁𝘀 𝗟𝗟𝗠 𝗯𝗲𝗮𝘁𝘀 𝗖𝗹𝗮𝘂𝗱𝗲-𝟯.𝟱-𝗦𝗼𝗻𝗻𝗲𝘁 𝗮𝗻𝗱 𝗚𝗣𝗧-𝟰𝗼!

@mattshumer , CEO from Hyperwrite AI, had an idea he wanted to try out: why not fine-tune LLMs to always output their thoughts in specific parts, delineated by <thinking> tags?

Even better: inside of that, you could nest other sections, to reflect critically on previous output. Let’s name this part <reflection>. Planning is also put in a separate step.

He named the method “Reflection tuning” and set out to fine-tune a Llama-3.1-70B with it.

Well it turns out, it works mind-boggingly well!

🤯 Reflection-70B beats GPT-4o, Sonnet-3.5, and even the much bigger Llama-3.1-405B!

𝗧𝗟;𝗗𝗥
🥊 This new 70B open-weights model beats GPT-4o, Claude Sonnet, et al.
⏰ 405B in training, coming soon
📚 Report coming next week
⚙️ Uses GlaiveAI synthetic data
🤗 Available on HF!

I’m starting an Inference Endpoint right now for this model to give it a spin!

Check it out 👉 mattshumer/Reflection-Llama-3.1-70B