-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 40 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 20
Collections
Discover the best community collections!
Collections including paper arxiv:2406.19389
-
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 104 -
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 54 -
Make Your LLM Fully Utilize the Context
Paper • 2404.16811 • Published • 52 -
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 91
-
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper • 2410.13861 • Published • 53 -
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Paper • 2411.07975 • Published • 27 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 71 -
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 43
-
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Paper • 2406.17294 • Published • 11 -
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Paper • 2406.19389 • Published • 53 -
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Paper • 2406.20076 • Published • 9 -
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
Paper • 2407.02869 • Published • 18
-
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Paper • 2406.19389 • Published • 53 -
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Paper • 2406.17557 • Published • 90 -
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
Paper • 2407.02485 • Published • 5 -
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Paper • 2407.01370 • Published • 86
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 29 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 34 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 9 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 14
-
GLaMM: Pixel Grounding Large Multimodal Model
Paper • 2311.03356 • Published • 33 -
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Paper • 2311.07575 • Published • 13 -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Paper • 2311.03354 • Published • 4 -
Language-Informed Visual Concept Learning
Paper • 2312.03587 • Published • 5
-
THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation
Paper • 2406.10996 • Published • 33 -
Simulating Classroom Education with LLM-Empowered Agents
Paper • 2406.19226 • Published • 31 -
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Paper • 2406.19389 • Published • 53 -
LAMBDA: A Large Model Based Data Agent
Paper • 2407.17535 • Published • 35