Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published 7 days ago • 21
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Paper • 2412.15213 • Published Dec 19, 2024 • 26
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published 17 days ago • 49
The Lessons of Developing Process Reward Models in Mathematical Reasoning Paper • 2501.07301 • Published 17 days ago • 89
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published 17 days ago • 49
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning Paper • 2408.07931 • Published Aug 15, 2024 • 20
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision Paper • 2407.06189 • Published Jul 8, 2024 • 26
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models Paper • 2309.07986 • Published Sep 14, 2023 • 3
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding Paper • 2407.01791 • Published Jul 1, 2024 • 5
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding Paper • 2407.01791 • Published Jul 1, 2024 • 5
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding Paper • 2407.01791 • Published Jul 1, 2024 • 5 • 1
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation Paper • 2404.11565 • Published Apr 17, 2024 • 15