-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2403.14624
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 180 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 50 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Paper • 2403.14624 • Published • 52 -
AI4Math/MathVerse
Viewer • Updated • 4.73k • 1.85k • 48 -
CaraJ/MathVerse-lmmseval
Viewer • Updated • 8.67k • 1.13k • 2 -
CaraJ/Mathverse_VLMEvalKit
Viewer • Updated • 8.67k • 367 • 1
-
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 108 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 54 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 47
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Paper • 2403.14624 • Published • 52 -
Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
Paper • 2312.17080 • Published • 1 -
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Paper • 2407.01284 • Published • 78 -
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Paper • 2411.00836 • Published • 15
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Paper • 2403.14624 • Published • 52 -
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Paper • 2407.01284 • Published • 78 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33
-
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Paper • 2403.09029 • Published • 55 -
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 25 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 70 -
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 77