Good Papers - a steveyin Collection

Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

steveyin 's Collections

object detection

Good Papers

updated 2 days ago

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 20
Spectrally Pruned Gaussian Fields with Neural Compensation

Paper • 2405.00676 • Published May 1, 2024 • 10
Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Paper • 2404.18212 • Published Apr 28, 2024 • 29
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29, 2024 • 120
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14, 2024 • 15
LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15, 2024 • 88
Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20, 2024 • 27
FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19, 2024 • 54
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Paper • 2406.02523 • Published Jun 4, 2024 • 12
Towards a Personal Health Large Language Model

Paper • 2406.06474 • Published Jun 10, 2024 • 23
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

Paper • 2406.06216 • Published Jun 10, 2024 • 22
Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 28
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Paper • 2406.06469 • Published Jun 10, 2024 • 27
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24, 2024 • 60
VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 24
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 97
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 37
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13, 2024 • 21
Transformers meet Neural Algorithmic Reasoners

Paper • 2406.09308 • Published Jun 13, 2024 • 44
MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Paper • 2406.05338 • Published Jun 8, 2024 • 41
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11, 2024 • 36
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 38
The Prompt Report: A Systematic Survey of Prompting Techniques

Paper • 2406.06608 • Published Jun 6, 2024 • 59
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 58
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Paper • 2406.07472 • Published Jun 11, 2024 • 13
Mixture-of-Agents Enhances Large Language Model Capabilities

Paper • 2406.04692 • Published Jun 7, 2024 • 57
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3, 2024 • 34
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Paper • 2406.02430 • Published Jun 4, 2024 • 34
Agentless: Demystifying LLM-based Software Engineering Agents

Paper • 2407.01489 • Published Jul 1, 2024 • 61
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 23
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 95
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 23
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Paper • 2407.02869 • Published Jul 3, 2024 • 20
Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Paper • 2407.04620 • Published Jul 5, 2024 • 31
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9, 2024 • 27
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Paper • 2407.06938 • Published Jul 9, 2024 • 23
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Paper • 2406.18009 • Published Jun 26, 2024 • 23
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28, 2024 • 61
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 27
Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Paper • 2406.10539 • Published Jun 15, 2024 • 1
Cross Anything: General Quadruped Robot Navigation through Complex Terrains

Paper • 2407.16412 • Published Jul 23, 2024 • 6
A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data

Paper • 2407.16680 • Published Jul 23, 2024 • 12
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Paper • 2407.14931 • Published Jul 20, 2024 • 22
EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19, 2024 • 43
The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Paper • 2407.14402 • Published Jul 19, 2024 • 14
Internal Consistency and Self-Feedback in Large Language Models: A Survey

Paper • 2407.14507 • Published Jul 19, 2024 • 46
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Paper • 2407.13833 • Published Jul 18, 2024 • 12
3D Gaussian Editing with A Single Image

Paper • 2408.07540 • Published Aug 14, 2024 • 11
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Paper • 2408.10195 • Published Aug 19, 2024 • 13
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

Paper • 2408.10161 • Published Aug 19, 2024 • 15
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15, 2024 • 21
Automated Design of Agentic Systems

Paper • 2408.08435 • Published Aug 15, 2024 • 39
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 126
gsplat: An Open-Source Library for Gaussian Splatting

Paper • 2409.06765 • Published Sep 10, 2024 • 16
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 57
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Paper • 2409.04196 • Published Sep 6, 2024 • 15
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 53
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Paper • 2408.16768 • Published Aug 29, 2024 • 27
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 25
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17, 2024 • 20
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends

Paper • 2409.14195 • Published Sep 21, 2024 • 13
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Paper • 2409.18121 • Published Sep 26, 2024 • 9
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Paper • 2409.17280 • Published Sep 25, 2024 • 11
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

Paper • 2409.17580 • Published Sep 26, 2024 • 9
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

Paper • 2410.01744 • Published Oct 2, 2024 • 26
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Paper • 2410.00531 • Published Oct 1, 2024 • 31
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 41
FAN: Fourier Analysis Networks

Paper • 2410.02675 • Published Oct 3, 2024 • 26
Differential Transformer

Paper • 2410.05258 • Published Oct 7, 2024 • 171
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 31
Revealing the Barriers of Language Agents in Planning

Paper • 2410.12409 • Published Oct 16, 2024 • 26
What Matters in Transformers? Not All Attention is Needed

Paper • 2406.15786 • Published Jun 22, 2024 • 31
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

Paper • 2410.09704 • Published Oct 13, 2024 • 13
Benchmarking Agentic Workflow Generation

Paper • 2410.07869 • Published Oct 10, 2024 • 26
Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 24
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Paper • 2409.16299 • Published Sep 9, 2024 • 12
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23, 2024 • 14
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Paper • 2410.13924 • Published Oct 17, 2024 • 7
LLM-based Optimization of Compound AI Systems: A Survey

Paper • 2410.16392 • Published Oct 21, 2024 • 15
Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15, 2024 • 22
Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37
WAFFLE: Multi-Modal Model for Automated Front-End Development

Paper • 2410.18362 • Published Oct 24, 2024 • 13
A Survey of Small Language Models

Paper • 2410.20011 • Published Oct 25, 2024 • 40
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Paper • 2410.18603 • Published Oct 24, 2024 • 32
GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 84
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Paper • 2410.23743 • Published Oct 31, 2024 • 62
AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 59
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21, 2024 • 81
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

Paper • 2501.04682 • Published Jan 8 • 91
LLM4SR: A Survey on Large Language Models for Scientific Research

Paper • 2501.04306 • Published Jan 8 • 35
Agent Laboratory: Using LLM Agents as Research Assistants

Paper • 2501.04227 • Published Jan 8 • 86
Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 69
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published Jan 3 • 42
StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18
DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 34
Learning from Massive Human Videos for Universal Humanoid Pose Control

Paper • 2412.14172 • Published Dec 18, 2024 • 10
Wonderland: Navigating 3D Scenes from a Single Image

Paper • 2412.12091 • Published Dec 16, 2024 • 16
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

Paper • 2412.00568 • Published Nov 30, 2024 • 15
GameFactory: Creating New Games with Generative Interactive Videos

Paper • 2501.08325 • Published Jan 14 • 64
NeoBERT: A Next-Generation BERT

Paper • 2502.19587 • Published 10 days ago • 38
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Paper • 2503.01743 • Published 5 days ago • 64

Collection guide
Browse collections

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs