CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation Paper • 2501.09433 • Published 2 days ago • 12
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation Paper • 2501.09755 • Published 2 days ago • 23
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps Paper • 2501.09732 • Published 2 days ago • 41
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning Paper • 2501.06458 • Published 8 days ago • 29
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Paper • 2501.06282 • Published 8 days ago • 32
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Paper • 2501.07888 • Published 5 days ago • 13
Diffusion Adversarial Post-Training for One-Step Video Generation Paper • 2501.08316 • Published 4 days ago • 29
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering Paper • 2501.05131 • Published 9 days ago • 32
MiniMax-01: Scaling Foundation Models with Lightning Attention Paper • 2501.08313 • Published 4 days ago • 258
Towards Best Practices for Open Datasets for LLM Training Paper • 2501.08365 • Published 4 days ago • 40
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration Paper • 2501.01320 • Published 16 days ago • 11
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Paper • 2501.01423 • Published 16 days ago • 36
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 17 days ago • 95
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models Paper • 2501.00874 • Published 17 days ago • 12
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Paper • 2501.01904 • Published 15 days ago • 31
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 15 days ago • 40