Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Paper • 2407.14435 • Published 3 days ago • 5
Understanding Reference Policies in Direct Preference Optimization Paper • 2407.13709 • Published 4 days ago • 11
Shape of Motion: 4D Reconstruction from a Single Video Paper • 2407.13764 • Published 4 days ago • 14
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion Paper • 2407.13759 • Published 4 days ago • 12
E5-V: Universal Embeddings with Multimodal Large Language Models Paper • 2407.12580 • Published 5 days ago • 31
Scaling Diffusion Transformers to 16 Billion Parameters Paper • 2407.11633 • Published 6 days ago • 21
GRUtopia: Dream General Robots in a City at Scale Paper • 2407.10943 • Published 7 days ago • 20
Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models Paper • 2407.10285 • Published 8 days ago • 4
StyleSplat: 3D Object Style Transfer with Gaussian Splatting Paper • 2407.09473 • Published 10 days ago • 10
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On Paper • 2407.08348 • Published 11 days ago • 46
Controlling Space and Time with Diffusion Models Paper • 2407.07860 • Published 12 days ago • 15
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models Paper • 2407.06938 • Published 13 days ago • 20
VIMI: Grounding Video Generation through Multi-modal Instruction Paper • 2407.06304 • Published 14 days ago • 8
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale Paper • 2407.05282 • Published 15 days ago • 9
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published 18 days ago • 33
Investigating Decoder-only Large Language Models for Speech-to-text Translation Paper • 2407.03169 • Published 19 days ago • 9
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published 20 days ago • 47
InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation Paper • 2407.00788 • Published 22 days ago • 20
Direct Preference Knowledge Distillation for Large Language Models Paper • 2406.19774 • Published 24 days ago • 21
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? Paper • 2407.01284 • Published 21 days ago • 72
DETRs Beat YOLOs on Real-time Object Detection Paper • 2304.08069 • Published Apr 17, 2023 • 10
Understanding and Diagnosing Deep Reinforcement Learning Paper • 2406.16979 • Published 29 days ago • 8
Aligning Diffusion Models with Noise-Conditioned Perception Paper • 2406.17636 • Published 27 days ago • 26
DiffusionPDE: Generative PDE-Solving Under Partial Observation Paper • 2406.17763 • Published 27 days ago • 23
MotionBooth: Motion-Aware Customized Text-to-Video Generation Paper • 2406.17758 • Published 27 days ago • 18
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models Paper • 2406.16863 • Published 28 days ago • 10
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published Jun 18 • 35
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models Paper • 2406.14599 • Published Jun 20 • 16
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning Paper • 2406.14130 • Published Jun 20 • 10
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Paper • 2406.12034 • Published Jun 17 • 12
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published Jun 17 • 36
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published Jun 14 • 54
DiTFastAttn: Attention Compression for Diffusion Transformer Models Paper • 2406.08552 • Published Jun 12 • 21
OpenVLA: An Open-Source Vision-Language-Action Model Paper • 2406.09246 • Published Jun 13 • 30
view article Article From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate Jun 13 • 33
An Image is Worth 32 Tokens for Reconstruction and Generation Paper • 2406.07550 • Published Jun 11 • 54
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models Paper • 2406.07472 • Published Jun 11 • 10
GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement Paper • 2406.05649 • Published Jun 9 • 7
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Paper • 2406.06525 • Published Jun 10 • 62
Searching Priors Makes Text-to-Video Synthesis Better Paper • 2406.03215 • Published Jun 5 • 11
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation Paper • 2406.02509 • Published Jun 4 • 8
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published Jun 6 • 69
VideoTetris: Towards Compositional Text-to-Video Generation Paper • 2406.04277 • Published Jun 6 • 21
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step Paper • 2406.04314 • Published Jun 6 • 26
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion Paper • 2406.03184 • Published Jun 5 • 18
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31 • 16
Learning Temporally Consistent Video Depth from Video Diffusion Priors Paper • 2406.01493 • Published Jun 3 • 17
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling Paper • 2405.21048 • Published May 31 • 11
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts Paper • 2405.19893 • Published May 30 • 26
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning Paper • 2405.18386 • Published May 28 • 18