44 24 20

vansin

AI & ML interests

None yet

Recent Activity

upvoted a paper 20 days ago

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

upvoted a paper 22 days ago

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

posted an update 22 days ago

🔥MedAgentBench Amazing Work🚀 Just explored #MedAgentBench from @Yale researchers and it's mind-blowing! They've created a cutting-edge benchmark that finally exposes the true capabilities of LLMs in complex medical reasoning. ⚡ Key discoveries: DeepSeek R1 & OpenAI O3 dominate clinical reasoning tasks Agent-based frameworks deliver exceptional performance-cost balance Open-source alternatives are closing the gap at fraction of the cost This work shatters previous benchmarks that failed to challenge today's advanced models. The future of medical AI is here: https://github.com/gersteinlab/medagents-benchmark #MedicalAI #MachineLearning #AIinHealthcare 🔥

View all activity

Organizations

vansin's activity

commented 5 papers 22 days ago

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Paper • 2503.07459 • Published 27 days ago • 15 •

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Paper • 2503.06885 • Published 28 days ago • 3 •

MinorBench: A hand-built benchmark for content-based risks for children

Paper • 2503.10242 • Published 24 days ago • 4 •

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Paper • 2503.07459 • Published 27 days ago • 15 •

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Paper • 2503.06680 • Published 28 days ago • 18 •

commented 5 papers 23 days ago

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Paper • 2503.09573 • Published 25 days ago • 68 •

TPDiff: Temporal Pyramid Video Diffusion Model

Paper • 2503.09566 • Published 25 days ago • 44 •

commented 2 papers 25 days ago

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published 28 days ago • 10 •

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published 28 days ago • 10 •

commented 6 papers about 1 month ago

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Paper • 2503.01763 • Published Mar 3 • 4 •

FLAME: A Federated Learning Benchmark for Robotic Manipulation

Paper • 2503.01729 • Published Mar 3 • 4 •

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

Paper • 2503.01449 • Published Mar 3 • 4 •

CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

Paper • 2503.01378 • Published Mar 3 • 3 •

SwiLTra-Bench: The Swiss Legal Translation Benchmark

Paper • 2503.01372 • Published Mar 3 • 3 •

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Paper • 2502.14866 • Published Feb 20 • 13 •

New activity in cais/hle about 1 month ago

Application of InternLM3 to Join the leaderboard

#5 opened about 1 month ago by

vansin

New activity in vectara/leaderboard about 1 month ago

Application of InternLM3 to Join the leaderboard

#9 opened about 1 month ago by

vansin