-
CUGA Agent
🤖104Configurable Generalist Agent, leader in AppWorld Benchmark
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Paper • 2603.28407 • Published • 71 -
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Paper • 2604.04323 • Published • 41
Collections
Discover the best community collections!
Collections including paper arxiv:2603.28407
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 144 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 156 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 147
-
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Paper • 2505.13227 • Published • 46 -
facebook/natural_reasoning
Viewer • Updated • 1.15M • 2.11k • 571 -
nvidia/OpenMathReasoning
Viewer • Updated • 5.68M • 17k • 464 -
Search Arena: Analyzing Search-Augmented LLMs
Paper • 2506.05334 • Published • 19
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 107 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 79 -
In-Context Learning Creates Task Vectors
Paper • 2310.15916 • Published • 43 -
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 46
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 305 -
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Paper • 2511.05459 • Published • 5 -
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Paper • 2512.18470 • Published • 12 -
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Paper • 2601.09688 • Published • 127
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 173 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
CUGA Agent
🤖104Configurable Generalist Agent, leader in AppWorld Benchmark
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Paper • 2603.28407 • Published • 71 -
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Paper • 2604.04323 • Published • 41
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 144 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 156 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 147
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 305 -
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Paper • 2511.05459 • Published • 5 -
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Paper • 2512.18470 • Published • 12 -
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Paper • 2601.09688 • Published • 127
-
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Paper • 2505.13227 • Published • 46 -
facebook/natural_reasoning
Viewer • Updated • 1.15M • 2.11k • 571 -
nvidia/OpenMathReasoning
Viewer • Updated • 5.68M • 17k • 464 -
Search Arena: Analyzing Search-Augmented LLMs
Paper • 2506.05334 • Published • 19
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 173 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 107 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 79 -
In-Context Learning Creates Task Vectors
Paper • 2310.15916 • Published • 43 -
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 46