44 23 20

vansin

AI & ML interests

None yet

Recent Activity

upvoted a paper about 7 hours ago

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

posted an update about 13 hours ago

🔥MedAgentBench Amazing Work🚀 Just explored #MedAgentBench from @Yale researchers and it's mind-blowing! They've created a cutting-edge benchmark that finally exposes the true capabilities of LLMs in complex medical reasoning. ⚡ Key discoveries: DeepSeek R1 & OpenAI O3 dominate clinical reasoning tasks Agent-based frameworks deliver exceptional performance-cost balance Open-source alternatives are closing the gap at fraction of the cost This work shatters previous benchmarks that failed to challenge today's advanced models. The future of medical AI is here: https://github.com/gersteinlab/medagents-benchmark #MedicalAI #MachineLearning #AIinHealthcare 🔥

commented on a paper about 22 hours ago

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

View all activity

Organizations

vansin's activity

upvoted a paper about 7 hours ago

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published 11 days ago • 210

posted an update about 13 hours ago

Post

550

🔥MedAgentBench Amazing Work🚀

Just explored #MedAgentBench from @Yale researchers and it's mind-blowing! They've created a cutting-edge benchmark that finally exposes the true capabilities of LLMs in complex medical reasoning.

⚡ Key discoveries:

DeepSeek R1 & OpenAI O3 dominate clinical reasoning tasks
Agent-based frameworks deliver exceptional performance-cost balance
Open-source alternatives are closing the gap at fraction of the cost

This work shatters previous benchmarks that failed to challenge today's advanced models.
The future of medical AI is here: https://github.com/gersteinlab/medagents-benchmark
#MedicalAI #MachineLearning #AIinHealthcare 🔥

commented 2 papers about 22 hours ago

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Paper • 2503.07459 • Published 6 days ago • 14 •

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Paper • 2503.06885 • Published 6 days ago • 3 •

upvoted a paper about 23 hours ago

MinorBench: A hand-built benchmark for content-based risks for children

Paper • 2503.10242 • Published 3 days ago • 4

commented 3 papers about 23 hours ago

MinorBench: A hand-built benchmark for content-based risks for children

Paper • 2503.10242 • Published 3 days ago • 4 •

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Paper • 2503.07459 • Published 6 days ago • 14 •

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Paper • 2503.06680 • Published 7 days ago • 17 •

commented 3 papers 1 day ago

upvoted a paper 1 day ago

Charting and Navigating Hugging Face's Model Atlas

Paper • 2503.10633 • Published 3 days ago • 50

commented 2 papers 2 days ago

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Paper • 2503.09573 • Published 4 days ago • 49 •

TPDiff: Temporal Pyramid Video Diffusion Model

Paper • 2503.09566 • Published 4 days ago • 39 •

upvoted a paper 3 days ago

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published 7 days ago • 9

commented 2 papers 3 days ago

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published 7 days ago • 9 •

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published 7 days ago • 9 •

upvoted a paper 4 days ago

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Paper • 2503.08638 • Published 5 days ago • 56

commented 2 papers 10 days ago

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Paper • 2503.01763 • Published 13 days ago • 4 •

FLAME: A Federated Learning Benchmark for Robotic Manipulation

Paper • 2503.01729 • Published 13 days ago • 4 •