NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper • 2504.13055 • Published 5 days ago • 18
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models Paper • 2504.11468 • Published 12 days ago • 26
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers Paper • 2504.10483 • Published 8 days ago • 20
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper • 2504.09641 • Published 9 days ago • 15
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Paper • 2504.08685 • Published 11 days ago • 120
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Paper • 2504.08736 • Published 11 days ago • 47
Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging Paper • 2504.08635 • Published 11 days ago • 5
SigLIP Collection Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 • 10 items • Updated 19 days ago • 55
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion Paper • 2411.18552 • Published Nov 27, 2024 • 18
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on Paper • 2411.10499 • Published Nov 15, 2024 • 13