-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 30 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 101
Collections
Discover the best community collections!
Collections including paper arxiv:2504.05298
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 131 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 117 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 44 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 25
-
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Paper • 2410.10306 • Published • 57 -
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Paper • 2411.05003 • Published • 72 -
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Paper • 2411.04709 • Published • 27 -
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Paper • 2410.07171 • Published • 44
-
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Paper • 2402.14083 • Published • 49 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 614 -
Genie: Generative Interactive Environments
Paper • 2402.15391 • Published • 72 -
Humanoid Locomotion as Next Token Prediction
Paper • 2402.19469 • Published • 29
-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 17 -
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 -
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 10 -
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13