MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published 2 days ago • 11
Jina CLIP: Your CLIP Model Is Also Your Text Retriever Paper • 2405.20204 • Published 3 days ago • 17
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts Paper • 2405.19893 • Published 3 days ago • 13
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series Paper • 2405.19327 • Published 3 days ago • 34
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels Paper • 2405.16822 • Published 6 days ago • 10
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer Paper • 2405.17405 • Published 5 days ago • 12
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published 8 days ago • 41
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Paper • 2405.15071 • Published 9 days ago • 30
Aya 23: Open Weight Releases to Further Multilingual Progress Paper • 2405.15032 • Published 9 days ago • 21
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published 9 days ago • 45
CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner Paper • 2405.14979 • Published 9 days ago • 13
iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published 9 days ago • 11
ReVideo: Remake a Video with Motion and Content Control Paper • 2405.13865 • Published 10 days ago • 19
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published 13 days ago • 33
FIFO-Diffusion: Generating Infinite Videos from Text without Training Paper • 2405.11473 • Published 14 days ago • 50
CAT3D: Create Anything in 3D with Multi-View Diffusion Models Paper • 2405.10314 • Published 16 days ago • 37
Many-Shot In-Context Learning in Multimodal Foundation Models Paper • 2405.09798 • Published 17 days ago • 25
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published 17 days ago • 95
LLM-AD: Large Language Model based Audio Description System Paper • 2405.00983 • Published May 2 • 13
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation Paper • 2405.01434 • Published about 1 month ago • 44
Paint by Inpaint: Learning to Add Image Objects by Removing Them First Paper • 2404.18212 • Published Apr 28 • 20
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25 • 49
Interactive3D: Create What You Want by Interactive 3D Generation Paper • 2404.16510 • Published Apr 25 • 18
MotionMaster: Training-free Camera Motion Transfer For Video Generation Paper • 2404.15789 • Published Apr 24 • 10
SnapKV: LLM Knows What You are Looking for Before Generation Paper • 2404.14469 • Published Apr 22 • 23
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 238
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Paper • 2404.12803 • Published Apr 19 • 27
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation Paper • 2404.13026 • Published Apr 19 • 21
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation Paper • 2404.11565 • Published Apr 17 • 12
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 23
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Paper • 2404.11912 • Published Apr 18 • 16
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12 • 62
Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video Paper • 2404.09833 • Published Apr 15 • 27
Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior Paper • 2404.06780 • Published Apr 10 • 9
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion Paper • 2404.07199 • Published Apr 10 • 22
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback Paper • 2404.07987 • Published Apr 11 • 46
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Paper • 2404.03715 • Published Apr 4 • 58
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 21
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image Paper • 2404.02152 • Published Apr 2 • 3
Freditor: High-Fidelity and Transferable NeRF Editing by Frequency Decomposition Paper • 2404.02514 • Published Apr 3 • 9
Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes Paper • 2404.01543 • Published Apr 2 • 3
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Paper • 2404.01331 • Published Mar 29 • 22